Reading the Deep Learning Book - Chapter 2

This are my notes and observations from reading the Linear Algebra chapter of the Deep Learning book.

The following notes are presented in order of value to the reader. I start with a discussion of the Moore-Penrose pseudoinverse, followed by a short reflection on "broadcasting".

Moore-Penrose pseudoinverse

The Moore-Penrose pseudoinverse is defined as:

\[ A^{+}:=\lim _{\alpha \searrow 0}\left( A^{T}A+\alpha I\right) ^{-1}A^{T} \]

You can click on the formula to jump into the book. (It's really awesome that all of it is online and you can deep link to pages.)

This formula is mystifying. What is its origin? And why does it do what the book says it does?

We can get a feeling for its origin by looking at the equation \(Ax=b\) that it solves:

\[ \begin{align*} Ax &=b &|& A^{T}\cdot \\ A^{T}Ax &=A^{T}b &|& \left( A^{T}A\right) ^{-1} \cdot \\ \left( A^{T}A\right) ^{-1}A^{T}Ax &=\left( A^{T}A\right) ^{-1}A^{T}b & & \\ x &=\left( A^{T}A\right) ^{-1}A^{T}b & & \end{align*} \]

So we start with the equation that we want to solve and find a more complex form for solving for \(x\) than the usual left-multiplication with the inverse \(A^{-1}\). This is already useful if \(A^T A\) is invertible and \(A\) is not square (because matrix inverses only exist for square matrices).

This already looks similar to the pseudo-inverse, except for the \(+\alpha I\) term.

Aside: there is also a right pseudo-inverse

We can guess its form by starting with:

\[ \begin{align*} yA&=b&|&\cdot A^{T}\\ yAA^{T}&=bA^{T}&|&\cdot \left( AA^{T}\right) ^{-1}\\ yAA^{T}\left( AA^{T}\right) ^{-1}&=bA^{T}\left( AA^{T}\right) ^{-1}\\ y&=bA^{T}\left( AA^{T}\right) ^{-1}\end{align*} \]

From this, we can make an educated guess:

\[ ^{+}A = A^{T}\lim _{\alpha \searrow 0}\left( AA^{T}+\alpha I\right) ^{-1} \]

Since we cannot always invert \(A^TA\) (respectively \(AA^T\)), an obvious question is:

Why can we invert \(AA^T+ \alpha I\) for positive \(\alpha\)?

Let's examine this. For a matrix to be invertible, its kernel has to only contain the zero vector:

\[\ker \left( A^{T}A+\alpha I\right) =\left\{ 0\right\}\]

To prove that this is the case for \(\left( AA^{T}+\alpha I\right)\), we need to show that:

\[ \left( A^{T}A+\alpha I\right) v = 0 \implies v = 0\]

So starting with the left side, we can rephrase it as follows:

\[ \begin{align*} & & \left( A^{T}A+\alpha I\right) v&= 0\\ & \Leftrightarrow &A^{T}Av+\alpha Iv&= 0\\ & \Leftrightarrow &A^{T}Av &= -\alpha v \end{align*} \]

If \(v \ne 0\), this would mean that \(-\alpha\) is a negative eigenvalue of \(A^TA\) as \(\alpha\) is \(>0\).

Can \(A^TA\) have negative eigenvalues?

Let's assume \(v \ne 0\) and \(-\alpha\) is a negative eigenvalue, that is \(A^{T}Av = -\alpha v\) holds (and \(\alpha > 0\)). We can left-multiply with \(v^T\) and obtain:

\[ \begin{align*} & \Leftrightarrow v^{T}A^{T}Av=v^{T}\left( -\alpha v\right) \\ & \Leftrightarrow \left\| Av\right\| _{2}^{2}=-\alpha v^{T}v=-\alpha \left\| v\right\| _{2}^{2}\\ & \Leftrightarrow -\alpha =\dfrac {\left\| Av\right\| _{2}^{2}} {\left\| v\right\| _{2}^{2}}\geq 0 \end{align*} \]

(We can divide by \(\left\| v\right\| _{2}^{2}\) because we assume \(v\ne0\).) Now this means, that \(-\alpha\) has to be \(>0\), so it is not a negative eigenvalue and a contradiction to our initial assumption \(\alpha > 0\). In fact, we have just shown that \(A^TA\) in general can only have non-negative eigenvalues. (We can show the same for \(AA^T\) the same way.)

Because \(A^TA\) cannot have negative eigenvalues, this means that \(v\) cannot be \(\ne 0\), so \(v = 0\). This is what we wanted to show, and this means that the kernel only contains the zero vector, and \(AA^T + \alpha I\) is invertible for \(\alpha > 0\).

What if \(A^T A\) was invertible?

For a moment, let's consider the case when \(A^T A\) is invertible: Is the Moore-Penrose pseudoinverse then equal to \(\left (A^T A \right)^{-1} A^T\)?

Matrix inversion is continuous in the space of invertible matrices. You might remember the definition of continuity from school. A better definition of continuity is that it means being able to swap function and limit application:

A function \(f\) is continuous on its domain iff for any convergent series \(x_n \to x\) for which \(f \left( x \right )\) is defined \[ \lim_{n \to \infty } f \left ( x_n \right ) = f \left ( \lim_{n \to \infty} x_n \right ) = f \left( x \right).\]

Assuming \(A^T A\) is invertible and using the fact that matrix inversion is continuous for invertible matrices, we now see:

\[ \begin{align*} \lim _{\alpha \searrow 0}\left( A^{T} A +\alpha I\right) ^{-1} A^T &= \left( \lim _{\alpha \searrow 0} A^{T} A +\alpha I\right) ^{-1} A^T \\ &= \left( A^{T}A+0 \, I\right) ^{-1} A^T \\ &= \left( A^{T}A\right) ^{-1} A^T \end{align*} \]

So in this special case, the pseudoinverse is exactly the solution we have come up with ourselves.

Two properties of the pseudoinverse

There are two properties mentioned in text that are interesting but not obvious:

When \(A\) has more columns than rows, then solving a linear equation using the pseudoinverse provides one of the many possible solutions. Specifically, it provides the solution \(x=A^+y\) with minimal Euclidean norm \(\left\| x \right\|_2\) among all possible solutions. When \(A\) has more rows than columns, it is possible for there to be no solution. In this case, using the pseudoinverse gives us the \(x\) for which \(Ax\) is as close as possible to \(y\) in terms of Euclidean norm \(\left\| Ax - y \right\|_2\).

These properties are not obvious and their deduction is enlightening towards the chosen definition of the pseudoinverse, specifically the use of the limit and the constraint of \(\alpha\) to be \(>0\).

To prove the properties, let's look at the following regularized minimum-least-squares problem. This is something that is formulated in more detail in Chapter 4 of the book, but it is quite useful here:

\[\min _{x}\left| \left| Ax-b\right| \right| _{2}^{2}+\alpha \left\| x\right\| _{2}^{2}, \, \alpha > 0\]

Let's solve this optimization problem. First, we set up a cost function:

\[ \begin{align*} c_\alpha\left( x\right) &=\left\| Ax-b\right\| _{2}^{2}+\alpha \left\| x\right\| _{2}^{2}\\ & =\left( Ax-b\right) ^{T}\left( Ax-b\right) +\alpha x^{T}x\\ & =x^{T}AAx-2b^{T}Ax+b^{T}b+\alpha x^{T} x \end{align*} \\ \]

The first derivative of \(c_\alpha\) is:

\[ \nabla c_\alpha\left( x\right) =2A^{T}Ax-2A^{T}b+2\alpha x\]

And the second derivative is:

\[ H_\alpha \left ( x \right ) = 2A^T A + 2\alpha I\]

\(H_\alpha\) is positive definite, that is \(v^T H_\alpha \left ( x \right ) v \ge 0\) for all \(v\). Why? We have already seen that \(A^T A\) only has non-negative eigenvalues, so it is positive semidefinite by definition and \(\alpha I\) is trivially positive definite for \(\alpha > 0\). The sum of the two is positive definite again. Thus \(c\) is a strictly convex function. This is along to lines of the one-dimensional case: when the second derivative is \(> 0\) everywhere, the function is strictly convex. Convex functions have a global minimum and, for strictly convex functions, this global minimum is unique. So we know there is only exactly one point that minimizes the cost function \(c_\alpha\).

We can determine this global minimum \(x^*_\alpha\) by solving \(\nabla c_\alpha \left ( x \right ) = 0\):

\[ \begin{align*} & & \nabla c_\alpha\left( x\right) &= 0 \\ & \Leftrightarrow & A^{T}Ax-A^{T}b+ax&=0\\ & \Leftrightarrow &\left( A^{T}A+\alpha I\right) x&=A^{T}b\\ & \Leftrightarrow & x&=\left( A^{T}A+\alpha I\right) ^{-1}A^{T}b\end{align*} \]

This is exactly the definition of the pseudoinverse without the limit. So we have found:

\[ x^*_{\alpha} = \underset{x}{\operatorname{argmin}} \left| \left| Ax-b\right| \right| _{2}^{2}+\alpha \left\| x\right\| _{2}^{2} = \left( A^{T}A+\alpha I\right) ^{-1}A^{T}b \]

with \(c_\alpha \left (x^*_\alpha \right ) \leq c_\alpha \left ( x \right )\) for all \(x\), and \(x^*_\alpha\) denotes the minimum point.

This expression is continuous in \(\alpha\), so we can take the limit \(\alpha \searrow 0\). Remember, we need the constraint of \(\alpha > 0\) for \(A^TA + \alpha I\) to be invertible, so we can only take the limit from above. Taking the limit, we obtain:

\[ x^*= \lim_{x \searrow 0} \left( A^{T}A+\alpha I\right) ^{-1}A^{T}b \]

with \(c \left (x^* \right ) \leq c \left ( x \right )\) for all \(x\) with

\[c \left ( x \right ) = \lim_{\alpha \searrow 0} \left\| Ax-b\right\| _{2}^{2}+\alpha \left\| x\right\| _{2}^{2} = \left\| Ax-b\right\| _{2}^{2}.\]

What do we gain from this? Well for one, we now know that this expression minimizes \(\left| \left| Ax-b\right| \right| _{2}^{2}\). Furthermore, if there is a solution for \(A x =b\), we can easily use a similar approach to see that the solution \(x^*\) is smaller under Euclidean norm than any (other) solution \(\hat x\) for \(Ax=b\).

We do this in two steps. First, we observe that

\[A\hat x = b \Leftrightarrow A \hat x - b = 0 \Leftrightarrow \left\| A\widehat {x}-b\right\| _{2}^{2} = 0 \]

and, because \(x^*_\alpha\) minimizes \(c_\alpha\),

\[ \begin{align*} & & c_{\alpha}\left( x_{\alpha }^{*}\right) &\leq c_{\alpha }\left( \hat {x}\right) \\ & \Leftrightarrow & c_{\alpha}\left( x_{\alpha }^{*}\right) &\leq \left\| A\hat {x}-b\right\| _{2}^{2}+\left\| \hat {x}\right\| _{2}^{2}=0+\alpha \left\| \hat x\right\| _{2}^{2} \\ & \Leftrightarrow & c_{\alpha}\left( x_{\alpha }^{*}\right) & \leq \alpha \left\| \hat x\right\| _{2}^{2} \end{align*} \]

This expression is again continuous, so we can take the limit \(\alpha \searrow 0\), and we see:

\[\begin{align*} & c \left( x^{*} \right) \leq 0 \\ \Leftrightarrow & \left\| Ax^{*}-b\right\| _{2}^{2} \le 0 \end{align*} \]

Because norms are always non-negative, we have \(0 \le \left\| Ax^{*}-b\right\| _{2}^{2} \le 0\), so \(\left\| Ax^{*}-b\right\| _{2}^{2} = 0\). And we have observed above that this is equivalent to \(Ax^{*} = b\). So if there is at least one exact solution to the problem, we are sure to obtain an exact one, too. To be fair, we could have deduced this in the previous section. However, we can take another limit on the inequality \(c_{\alpha}\left( x_{\alpha }^{*}\right) \leq \alpha \left\| \hat x\right\| _{2}^{2}\) and obtain a more interesting result. This time, we only take the limit \(\alpha \searrow 0\) of \(x^{*}_\alpha\), but keep \(c_\alpha\) fixed:

\[ \begin{align*} &&c_{\alpha}\left( x_{\alpha }^{*}\right) &\leq \alpha \left\| \hat x\right\| _{2}^{2} \\ \Rightarrow && c_{\alpha}\left( \lim_{\alpha \searrow 0} x_{\alpha }^{*}\right) &\leq \alpha \left\| \hat x\right\| _{2}^{2} \\ \Leftrightarrow && c_{\alpha}\left( x^{*} \right) &\leq \alpha \left\| \hat x\right\| _{2}^{2} \\ \Leftrightarrow && \left\| Ax^{*}-b\right\| _{2}^{2}+\alpha \left\| x^{*}\right\| _{2}^{2} &\leq \alpha \left\| \hat x\right\| _{2}^{2} \\ \Rightarrow && \alpha \left\| x^{*}\right\| _{2}^{2} &\leq \alpha \left\| \hat x\right\| _{2}^{2} \\ \Leftrightarrow && \left\| x^{*}\right\| _{2}^{2} &\leq \left\| \hat x\right\| _{2}^{2} \end{align*} \]

Here, we use \(\left\| Ax^{*}-b\right\| _{2}^{2} \ge 0\) to be able to drop the term and \(\alpha > 0\) to preserve the direction of the inequality, and we obtain the second property about the length of \(\left \| x^{*} \right \|\).

Now we have proved both properties that were mentioned in the book. On the one hand, if there are solutions, we will obtain one with minimal norm. On the other hand, if there are no solutions, using the pseudoinverse will provide us with an \(x\) that at least minimizes \(\left| \left| Ax-b\right| \right| _{2}^{2}\).

To convince yourself that the expressions above are indeed continuous, we can use the technical argument that we can rewrite \(c_\alpha \left (x \right )\) into \(c \left (\alpha, x \right )\) and see that it is a continuous function in two variables instead of going the route of using functional analysis and treating \(c_\alpha\) as a convergent series of functions.

Broadcasting notation: oh my...

The broadcasting notation in the Deep Learning book is weird. For one moment, let's ignore its origins from numpy, and let's look at what's happening.

In the context of deep learning, we also use some less conventional notation. We allow the addition of matrix and a vector, yielding another matrix: \(C=A+b\), where \(C_{i,j}=A_{i,j}+b_j\). In other words, the vector \(b\) is added to each row of the matrix. This shorthand eliminates the need to define a matrix with \(b\) copied into each row before doing the addition. This implicit copying of \(b\) to many locations is called broadcasting.

Let's say \(A \in \mathbb{R}^{3 \times 3}\) and \(b \in \mathbb{R}^3\), for example:

\[ A = \left[\begin{matrix}0 & 0 & 0\\1 & 1 & 1\\2 & 2 & 2\end{matrix}\right] , b = \left[\begin{matrix}1\\2\\3\end{matrix}\right] \]

Then, with broadcasting:

\[ A + b = \left[\begin{matrix}1 & 2 & 3\\2 & 3 & 4\\3 & 4 & 5\end{matrix}\right] \]

How do we get there? Essentially, we take the vector \(b \in \mathbb{R}^3\), interpret it as a row vector, and add it to every row of the matrix, so:

\[ A + b= A + \left[\begin{matrix}1\\1\\1\end{matrix}\right] b^T \]

Wouldn't it make more sense to write this as \(A + b^T\)? The reason for the unintuitive notation lies in the details of broadcasting: whereas in maths, a vector \(\mathbb{R}^3\) is identified as a column matrix \(\mathbb{R}^{3 \times 1}\) , in the context of broadcasting it is treated as a row matrix \(\mathbb{R}^{1 \times 3}\). This row matrix is then repeated across the \(1\) dimension to make it into a \(3 \times 3\) matrix that can be applied to \(A\). However, for matrix multiplications, \(b\) continues to be treated as column matrix. This is really confusing. So what's the reason for this ambiguity?

My best guess is that in ML contexts, data entries (examples) are often treated as row vectors. For example, a design matrix stores a number of examples, each in a separate row. So if you want to change the bias of your examples, you want to add a bias vector as row-vector to each row. Indeed, in chapter 8, broadcasting is used to express batch normalization in eq (8.35):


I don't think this excuses the lack of clarity introduced with this notation, but then again it is too late to change and make everybody use a transposed design matrix... :)

Looking back and ahead

There is much more I could write about, but I think these were the most interesting bits from my notes. We have revisited convex functions, continuity and limits to motivate the curious definition of the Moore-Penrose pesudoinverse. Last but not least, we have looked at the origin of the broadcasting notation and why it might be confusing to someone who is new to ML. Next up: reading Chapter 3 and thinking about probability theory.

Stay tuned,

PS: Thanks to my friend Armin Krupp for suggestions and corrections of the draft. All current mistakes and factual errors have been added later.

Recent Medium Posts

I have published a couple of posts on Medium to see how it works:

A Dart REPL PoC - Hacking with Dart

A well-received post about which is a PoC interactive shell for Dart. I hope I've been able to restore some Karma points with the Dart team over this #butrisesagainharderandstronger..

Metamath - Formalizing math

A small write-up about my impressions from learning about Metamath and some questions that I really, really want to find time to look into.

A shoddy implementation in Dart can be found here:

Reading the Deep Learning Book - Chapter 1

As part of a long study break, I've started reading the I will try to write down my thoughts for each chapter over time as a way of engaging with the material.

Stay tuned for more posts and also cross-overs.

I'm still trying to decide whether to post the more mathy stuff here.


Markdown Blog

It's been a while, but now my blog is up using a static HTML generator with Markdown as underlying format \o/ The static HTML is generated using a couple of small go scripts that wrap pandoc and go's text/template. Converting all of my WordPress posts to MarkDown was as fun as such a task can be, and quite tedious. However, it was still nice to read everything I have ever posted again and allowed me to take a more holistic view on everything I've done. It has definitely helped me with some life decisions this year.

I don't think this is the end of my efforts to improve this blog because I really should move off the go scripts at some point, and the content menu is very mobile-first (it only looks and works well on my phone screen). Moreover, source highlighting does not work either.

The biggest positive take-away, apart from finishing this mini-project, was that committing all intermediate outputs to a repository during the conversion really helped with moving fast. Every time I changed something in the scripts or in the Markdown code, I'd see how it trickles down. With disciplined reviewing, this allowed me to avoid random breakages, and it made debugging quite easy and cheap because SmartGit's differ would do most of the work for me.

I used the opportunity to learn and write some simple Go code. I am split on the language. It feels crude and lacking abstractions. Also, duck typing structural typing makes it harder to understand and explore code. On the other hand, I have been told all of this is intentional to keep things simple. Indeed, it was always straight-forward to get the code to do the things I want, with few surprises, even though it did usually not feel particularly elegant.

My time estimate for the conversion is that I've spent about 1 week for research, 1 week to convert the posts and 1 week to learn go and write the scripts. All of this was spread out over about 2 years with bursts of work now and then.

I'm not sure what the next steps are for this blog. It feels good to have it looked down in HTML and Markdown in a clean version that is easier to back up. I want to play around with other ways of publishing and organizing information. This blog can be a way to centralize all my efforts and keep pointers and write-ups.

Stay tuned! :-)

Writer's block


I don't know what to write... to write is to create though. So let's create:

I've been busy for the last 2 years. Busy with work and busy living. Life always takes you to different places, and for me this was Zurich (from Munich) and Google (from university). Both changes were not that big and yet quite big. They have taken me far away from where I imagined myself to be in this moment of time 2.5 years ago. I was at a loss of words about this for long, but I think I'm finding my voice again. My life at Google is worth a series of articles at a later time, and Zurich is a whole different story with very little Computer Science in my spare time.

So let's go back to the topic: writer's block. How to overcome it? In a way, I have run out of creative steam during my master thesis, and even though I am busy at Google, it's a different kind of business to work in a big company and be told what to do. Definitely more boring than university usually :) In a way, my writer's block, which has kept this blog on hold for the last 24 months, has also been a creator's block. I have resolved to change this with a willingness to forego some sleep in the months to come and show more discipline and restraint in my outgoing nature ;)

My writer's block in the last year is also a consequence of my reluctance to produce content for the WordPress platform I'm using, which has proven quite painful to maintain over time. The plugins and hand-written glue that keeps some of the blog together is not holding as well as it used to anymore, and I'm not able to fix all the issues in my spare time.

It's more fun to create than to maintain in general.

In this spirit, I want to come up with a new blog platform to run this blog on. Assuming that everything around is not fitting my peculiar taste, I will have to come up with something new from old parts. A Frankensteinian blogging system. My main idea is to actually use unit-testing and regression testing on the components and content of the blog to identify issues and maintain a consistent quality overall. In particular, content should stay static and not break in unexpected ways when adding new content that is intended to be orthogonal.

Imitating PBRT-style literate programming in LaTeX

Today, I want to release another bit of code from my master's thesis. However, this time it won't be C++ code, instead I'm going to release some LaTeX code which I used to display source code fragments with.

Specifically the results look like this:


You can download the accompanying example PDF here.

This mirrors the style of PBRT, which uses literate programming to develop and explain an advanced raytracer. Only it doesn't require you to write source code that adheres to literal programming principles. I used it to take parts of the source code of my master's thesis and explain them.

The following LaTeX code is used to create the fragment above:

\defTaggedCurrentFileFragment<Another file fragment;recursive>+=

You can either inline code fragment using a special listing environment, or you can reference code fragments in an external file.

//<Another file fragment;recursive> =
int fac_r( int n ) {
    if( n <= 1 ) {
        return 1;
    return n * fac_r( n - 1 );

Cross-referencing is allowed as well and forward page references are created just like in PBRT. However, I have not figured out how to create backward references.

You can find the LaTeX code on GitHub. I have not wrapped it into a LaTeX package because I don't consider it mature enough, but maybe someone finds it useful, and it eventually ends up in a package (that would be neat :)).

To close, I want to say that the book More Math into LaTeX1 has been very useful while creating these LaTeX macros. It is the first book that covers creating custom macros and commands understandable and in-depth. It feels good to finally understand a bit more about how you can customize LaTeX to fit your needs :)


  1. See also

Creative learning, Scratch and alternatives for learning how to create games

I'm still working at university as a student researcher for the chair of software engineering in a very unconventional placement: I help prepare and teach courses about topics in computer science to pupils, usually age 12-16.

My university1 actually invests a lot of resources in promoting engineering degrees and applied sciences because the schools cannot. Schools here have only so many resources, and the computer science and mathematics curricula are mostly boring and uninteresting. For me, computer science is a lot about being creative and about identifying and solving problems. I'm not sure I would have been interested in computer science that much if it had been just another subject like biology or geography.

The courses we give are usually just one or two hours long for groups that visit our faculty, but we also have courses that last a whole day or two and cover topics in more detail. We usually choose topics that are more interesting than what you might learn at school and that show that computer science is not just about typing in code for hours without end but consists of analyzing real-world problems, devising algorithms and tinkering.

Most pupils already know a bit about computer science, and some even know how to code a bit with Java, but they are always positively surprised when we work out together how to get out of a labyrinth2, how to come up with sorting algorithms, or how to encrypt messages easily with some binary magic 3.

Right now I am tasked with updating and recreating a two-day course about creating games that was held two times in the last years. The original course was only intended for girls with no former coding experience and used Scratch as development tool.

The updated version should be a bit more advanced and introduce more elements of game development properly to be both fun and educational. For this, I examined Scratch and alternatives to determine how to proceed. It ended up consuming a whole day and when I was done I had a beautiful mind map covering quite a number of different tools and frameworks. Having some spare time, I decided to write it all up in a blog post :)


Scratch is a programming language with an IDE that is targeted at young people who have never coded before. It has been developed by the Lifelong Kindergarten group at the MIT Media Lab.


Coincidentally, I'm participating in an online lecture they are offering right now: Learning creative learning. It is free and quite interesting---and Scratch is mentioned and discussed there as well.

Scratch's user interface is very simple and intuitive to use. You can code programs without having to type code. Everything is based on blocks that can be dragged and dropped to create event-based behaviors. Users can draw their own sprites and animate them or easily import their own pictures and use them. Scratch has an active online community and it is easy to share your projects with others and get feedback, which encourages participation.

However, a more experienced user quickly finds many limitations; for example, it is not possible:

  • to create procedures, ie custom blocks, and the message sending blocks are too primitive to create really complex behaviors;
  • to rename variables, which makes refactoring impossible;
  • to change the sequence of statements easily; or
  • to create new sprites (game objects) dynamically.

Moreover, I already found a Heisenbug: I could not write a simple lock to keep an event script from being run multiple times that works at run-time---it only works as described in the docs during step debugging (at reduced script execution speed).

Nonetheless, I consider it a very valuable tool that can be used to ease children into coding. The current 1.4 release needs to be installed on your computer, but they are working on a 2.0 release that will run entirely in the browser. You can give its beta a try here.

Creating programs using drag and drop is a very nice idea and I really appreciate Scratch's user interface. And I'm not the only one: Scratch has spawned many spin-offs that expand on its design. This has been possible because its source code has been released. You can find it here.

I have also examined some Scratch spin-offs that I have found:


Snap (formerly called BYOB for Build Your Own Block) extends Scratch and adds support for custom blocks, first-class lists and procedures, and continuations. It becomes a lot more like a functional language this way. You can run it here. The available documentation explains the new concepts very well. A German translation  is available for download, too. The last release is from 2011, so it might not be actively maintained anymore.


StarLogo TNG

StarLogo TNG extends Scratch into 3D and has been developed at the MIT as well. It is targeted at multi-agent systems and older users. It is more complex and there are some interesting tutorials and workshops available online: one shows how to build an ant population that knows how to pick up objects; and another one shows how to develop an epidemic model.

MIT App Inventor 

MIT's App Inventor lets you build mobile apps easily using a Scratch-based programming interface. It supports a WYSIWYG UI editor as well. This used to be Google App Inventor apparently.


GameFroot's editor looks pretty polished. It is browser-based, and it is very simple to create a side-scrolling platformer out of the box. However, scripts are not that important and most behavior is predefined. So it's probably not a good tool to tinker with. The games that have been created with it naturally look very similar.



Stencyl is another game-creating tool, but it looks very mature and there is a pro version that is sold for 80-150 USD per year to publish your games to mobile devices and computers. The normal version only supports browser games. It is updated regularly but there is no localization support (yet).


The documentation is fairly complete, and it supports writing behavior scripts by using a Scratch-like interface, which supports custom procedures as well, or by writing JavaScript code.

Lots of different games have been created with Stencyl. Here are some which I've tried and which are really fun to play:

  • The Little Who

    The Little Who

  • Balls in Space

    Balls in Space

  • The Wish

    The Wish

I've also looked at IDEs that are not based on Scratch.


Etoys is similar to Scratch but not as polished. However, it has been translated to many different languages. It also lacks the social community features which are so prominent in Scratch. Moreover, you cannot run projects in the browser straight-away because a special plugin has to be installed first. This all probably hinders its adaption.

When you actually give it a go, you will be surprised by how nice it is. It feels very different to Scratch because it uses a unified workspace. There is no special script plane like in Scratch. Everything is part of the workspace. For example, when you look through the tutorials, you not only see an elastic ball bouncing, but you also see the special workspace frame that contains all different animation frames and the script code to assign them to the ball one after the other to emulate its deformation. This way you don't have to imagine anything. It's all there in the open for you to see---and debug if necessary.

Sadly, there is not much documentation and for me it is not as intuitive as Scratch.


Alice is is similar to Scratch and Etoys, only that Alice is using 3D scenes and is geared towards story telling. It's quite to big with over 600MB, but for this you get quality models to play around with. I can't say much about its features. There is not much documentation available online. You probably need to buy their book "Learning to Program with Alice" to really learn more about it. I do not like this. However, there is some free stuff available. It is somewhat hidden, but here is a link.

Again, you only code using drag and drop, but you can switch to a Java mode which makes the blocks look like their equivalent Java statements.


Greenfoot can be used to teach programming with Java in a game-oriented context. Programming is done in a normal text editor. Greenfoot displays a class hierarchy and allows you to place instances of actors in a 2D scene. You can then control them manually or subclass actors and override their methods.

It's fairly basic but probably a good introduction to Java programming. If you start out with Scratch or Alice, it should be a small step from drag and drop coding to "real" coding.

Microsoft SmallBasic

I have learnt programming using good ol' QBASIC on my parent's 386, so I wouldn't disagree with a Basic dialect as a first language. Microsoft gave it some love, and it looks polished. I'm not sure how many are using it because there is no social community linked to it and the last release is from 2011. There is no specific support for creating games and the language is quite limited.

Microsoft TouchDevelop

This is really cool. Microsoft Research has a developed a fully fledged script language for mobile devices, so you can code using a touch-based interface. You can read more about it in the free online book "touchdevelop - programming on the go". Of course, it also runs in your browser and you can try it yourself :)

I don't think it is the best choice for creating games, but you can certainly learn how to program with it by creating "serious" apps for your mobile device. Maybe this makes it interesting for adult beginners to programming.

That's it for today.


  1. Technische Universität München,

  2. Maze solving algorithm

  3. Xor cypher

Al et WML

First, I have created some new pages regarding old university projects. Among them is a condensed page about light propagation volumes, which also made me update the project files to Visual Studio 2012, a page about my bachelor thesis in mathematics (Discrete Elastic Rods) and a page about my master thesis in computer science (Assisted Object Placement). I have not written about the latter two subjects before. Maybe I'll talk some more about them later and write a full post-mortem on them.

This is the first of a number of posts that will be related to my master thesis, or rather code drops from its code base. I have written about 60k LoC in the 6 months of my master thesis and there are a few bits that might be useful in the future.

The first one that I want to talk about is a very simple file format I came up with. Devising new text file formats is not something that I have been very keen about lately. Especially not as some many already exist. However, I have found none which has really fit my requirements:

  • minimal clutter (preferably indentation-based),
  • support for raw text inclusion, and
  • good C++ support.

JSON has too much clutter and doesn't support raw text. YAML, on the other hand, sounds like the perfect choice, even though it's not that easy to find a good library for it. However, when it comes to raw text, you run into the issue that tab characters are never allowed as indentation. Moreover, I was not very happy with the API choices and some bugs in the libraries that I tried to use.

So I decided to develop a very simple text-based data storage format:

WML - Whitespace Markup Language

I'll just start with an example, ie my readme. You can find the full readme here.

'Whitespace Markup Language':

        * very simple
        * no clutter while writing
        * only indentation counts
        * empty lines have no meaning
        * embedding text is easy
        * everything is a map internally


        title       "test\t\t1"
        path        'c:\unescaped.txt'
        version     1

            unformated text

            newlines count here

            time-changed    10:47am
            flags   archive system hidden

                    some data

                    this is nested too

                    key names
                    dont have to
                    be unique (see stream)
                        users   andreas root

As you can see, it is a whitespace-based format inspired by Python. Like in Python, indentation is used to convey structure. A WML file represents a map structure: every node has a name and possibly multiple children. A node definition in a WML file either contains the node name first and then multiple children names (which won't have any children themselves), the node name followed by one colon to signify that a nested definition follows (similar to JSON), or the node name followed by two colons to signify a raw text block. Two different kinds of strings are supported: single-quoted raw strings that do not interpret escape sequences, and double-quoted C strings that support escape sequences.

All in all, the grammar is very simple:

    INDENT, DEINDENT are virtual tokens that control the indentation level
    NEWLINE is a line break

    Indentation is done with tabs only at the moment.

    Here is a rough EBNF syntax for WML:

    root: map

    value: identifier | unescaped_string | escaped_string

    identifier: (!whitespace)+
    unescaped_string: '\'' (!'\'')* '\''
    escaped_string: '"' (!"\"")* '"' with support for \t, \n, \\, \', and \"

    key: value

    map: map_entry*

    map_entry: inline_entry | block_entry

    inline_entry: key value+ NEWLINE
    block_entry: key ':' ( ':' NEWLINE INDENT textblock DEINDENT | NEWLINE INDENT non-empty map DEINDENT )

    This file is itself a WML file and root["Whitespace Markup Language"]["Example"].data() is the example WML node

I've used this format for custom shaders as well as for my settings files and the declaration files of my test scenes:

        name 'platform, brown'
        size 20 2 20
        webColor 945412
        name 'platform, green'
        size 20 2 20
        webColor 129429
        name 'platform, muddy blue'
        size 20 2 20
        webColor 6488a5
        name 'platform, light blue'
        size 20 2 20
        webColor e5eaf1

I've uploaded the current code for WML to GitHub, and you can find the code here. The API supports an overloaded index operator to access children of a node and contains both a parser and an emitter for WML.


First, the current API isn't brilliant. It would be nice to separate the data model from the parser and emitter by using templates and type traits to improve abstraction. I think I might go and investigate different API types in the future and see which one works best for some simple cases.

Second, it would be possible to reduce clutter even more and remove the need for single colons to denote nested maps. A even simpler format could look like this:

    nodeB nodeC nodeD
        raw data
            with fixed indentation

        would be interpreted as "raw data\n\twith fixed intendation\n..."

This would yield the following JSON-equivalent:

{ "nodeA": { "nodeB": {} , "nodeC": {} , "nodeD": {}, "nodeE": {}, "nodeF": {}, "nodeG": { "raw data..." : {} } } }

This still separates raw text from normal data. A node that contains raw text can never contain other children this way. However, I cannot think of a good way to accomplish that without introducing a special character to end a raw text block.

That's it for now :)

Unity Prototype Project

For the last two days I've tried out the Unity engine with a friend from university, Andreas Ostermaier. If you don't know the Unity engine, go and check it out.

Its design is very similar to Torque 2D. I can't tell who used it first, but Unity's design is more mature than what I remember from Torque 2D 1.3 back when I was still using it in 2007. A scene is made up of game objects. A game object is a container of components (also called behaviors)1. A game object always contains a Transform component which places it in the scene (and in the scene hierarchy). Usually, it contains a Mesh component and a Physics component for handling rendering and collision detection. But sometimes empty game objects are useful as well: as respawn points for example. Custom scripts, written in JavaScript or C#, can be wrapped in a Script component and tied to game objects in the same way.

Templates/prototype objects2, called prefabs, are used to avoid creating all game objects by hand. Instances of a prefab link back to the prefab, that is the original object, and automatically inherit changes to it (or the children in its hierarchy). Torque 2D 1.3 lacked this feature: GarageGames had an editor plugin in development when I was doing contract work for them but I'm not sure it has ever been released.

We have spent the last two days implementing a PoC for a game idea my friend came up with. The idea is about being able to switch between 2D and 3D in a platformer to solve puzzles. It is about reinterpreting a level when it is seen from a 2D perspective. For example, two separate tiles above an acid pool that cannot be reached by jumping from one to the other can suddenly be connected by switching to 2D because they look connected.

PoCDimRunner 3D PoCDimRunner 2D >

You can play the PoC build here, using Unity's webplayer, or download it here in a .zip archive. Use the left mouse button or 'c' to switch between 2D and 3D.

There are a few games that are similar to this idea: Echochrome, Super Paper Mario, and Crush (and its sequel Crush 3D).

You can find the code on GitHub (ie here).


  1. See eg herehere, and here for an overview.

  2. See[[2]]

Presentation (seminar) about convex analysis

Last year (yes, I'm really slow with writing things up "lately") I did a presentation about convex analysis as part of a seminar about inverse problems in imaging.

I want to write about it today.

The seminar was based on the book 'Handbook of Mathematical Methods in Imaging (Springer Reference)'. It's very expensive and, in my opinion, pretty useless to learn about any details. At least the chapter about duality and convex programming was very dense and you could not learn anything from it without consulting other book.

The presentation was supposed to give an overview over duality and convex analysis and serve as introduction. Summarzing 42 very concise pages in 45 minutes is impossible, so I had to choose a few topics that give an overview over the important concepts and go into detail there.

The book itself does not really contain many proofs which makes it hard to follow. As a mathematician I like proofs. They help clear things up usually and they can give valuable insight into the methods of a theory when they're good.

But proofs suck when you see them on powerpoint slides. They also suck when you're the presenter.

So I gave a blackboard presentation---using a few slides only to show some pictures and sketches which I could not draw well enough on a blackboard.

To prepare the presentation I used OneNote and my Bamboo tablet. I scribbled all down over a weekend in one OneNote notebook and later extracted some sketches into another notebook that I used instead of a PowerPoint presentation. I exported the original notes as PDF and used this as handout after the presentation.

The presentation was alright. I had fun writing everything down on the blackboard and developing the subset of the theory I wanted to present, and the audience was content with it, too, because it was just at the right speed and like a lecture everybody was used, too.

In retrospect I can say that using OneNote and a tablet to create a blackboard presentation was the best decision. I don't want to think of the time I would have lost if I had created everything with LaTeX or PowerPoint.

Long story short: here are my notes :)

Cheers, Andreas

Molecular Dynamics and CUDA---my interdisciplinary project

You have to do a interdisciplinary project for your Master's degree at the Technische Universität München. I decided to do it at the Chair for Scientific Computing.

My topic was the Efficient Implementation of Multi-Center Potential Models in Molecular Dynamics with CUDA. For this, I have parallelized the molecule interaction calculations in the molecular dynamics simulator MarDyn with CUDA, optimized the code, and added support for more advanced potential models on the GPU.

What is molecular dynamics?

Molecular dynamics deals with the simulation of atoms and molecules. It estimates the interaction between a high number of molecules and atoms using force fields which are modeled using quantum mechanics equations.

Such simulations are important for many different areas in biology, chemistry and material sciences. For example you can examine the folding of proteins or the nanoscopic behavior of materials. The simulations are sped up using sophisticated approximations and algorithms.

Molecules only have strong force interactions with molecules that are nearby. One of the first approximations is to take into account these strong interactions only and ignore weaker long-distance ones. This space locality of the calculations leads to the Linked Cell-algorithm. It divides the simulation space into a grid of cells and only interactions between molecules inside each cell and with nearby cells are calculated.

Molecules are composed of atoms which interact using different physical principles. This is approximated by having different sites in a molecule that use different potential models. The most common and simplest potential model is the Lennard-Jones potential.

What is MarDyn?

MarDyn is a molecular dynamics simulator that has been developed at the Technische Universität München. See Martin Buchholz's thesis for more information.

Its code is rather ugly and it features some crazy design decisions. I have been told that the people who were initially developing it where learning how to code in C++ at the same time---and it shows :(

Previous work

When I started, there was a 'working' implementation of Lennard-Jones potentials for molecules with only one site on the GPU. It used OpenCL and was the result from somebody else's IDP. However, it was not very useful: the code was crammed into one function and one letter variable names were used through-out.

The main aims of the project

The idea was to port the previous OpenCL implementation to CUDA and continue from there to add support for multiple sites with different potential models.

Chronology of my work

Porting the code to CUDA was a straight-forward API change. However, I found the original code to be impossible to read, maintain, and optimize. The logic behind the code was not clear or well explained.

Consequently my supervisor and I decided that I would rewrite everything and optimize it from the beginning. This took longer than expected and while the resulting parallelism and the logic behind it were clear in the code, code complexity was an issue. The code consisted of three helper functions and two kernel functions in two files and when I went and integrated support for multiple sites and potential models, it became clear that the current design was not clear enough to endure further feature additions and lacked the flexibility for quick changes.

Treating the old implementation as a prototype and knowing about many possible traps and fallacies, I set out and rewrote everything again. This time the focus was on modularity and separation of concerns instead of performance. Code architecture was the most important thing this time around. I scaffolded the new version around the old code, which was working correctly and already optimized. I embedded it into the new design step by step. Afterwards I optimized the code.

What I've done


  • worked with CUDA 1.x and 2.x,
  • used both the driver and the runtime API,
  • implemented a template-based, very modular code design,
  • tried lots of optimizations, and
  • learnt how to read PTX to find work-arounds for compiler bugs.

Sadly some of the optimizations didn't improve the performance noticeably. I think the monolithic, all-in-one approach I've used is the main issue.

My code uses the driver API, because it would be easy to dump all the CUDA calls and replay them later for debugging purposes.

If I could iterate over the code one more time, I'd try to use many small kernels instead of only two rather large ones. Currently the kernels sequentially calculate all potential models for the molecule pair that are needed (for each site). This makes it difficult to measure performance improvements and isolate bottlenecks. And I suspect it's slower than it has to be.

I'd also create a sandbox application to test and develop the kernels instead of using MarDyn as starting point, because it is unnecessarily huge and it is not really needed to verify that the CUDA code works (or not).

Final paper & code

You can find the final paper that contains a description of my code and the performance results here, and my code here (I cannot include MarDyn's code because its source is closed, but at least I can share my code).


I just want to share two more links:

  • HOOMD-blue is a molecular dynamics framework, which uses CUDA, too, and the code design looks what I probably should have done :)
  • VMD is a molecular visualization application, which looks promising, too