You have to do a interdisciplinary project for your Master's degree at the Technische Universität München. I decided to do it at the Chair for Scientific Computing.
My topic was the Efficient Implementation of Multi-Center Potential Models in Molecular Dynamics with CUDA. For this, I have parallelized the molecule interaction calculations in the molecular dynamics simulator MarDyn with CUDA, optimized the code, and added support for more advanced potential models on the GPU.
Molecular dynamics deals with the simulation of atoms and molecules. It estimates the interaction between a high number of molecules and atoms using force fields which are modeled using quantum mechanics equations.
Such simulations are important for many different areas in biology, chemistry and material sciences. For example you can examine the folding of proteins or the nanoscopic behavior of materials. The simulations are sped up using sophisticated approximations and algorithms.
Molecules only have strong force interactions with molecules that are nearby. One of the first approximations is to take into account these strong interactions only and ignore weaker long-distance ones. This space locality of the calculations leads to the Linked Cell-algorithm. It divides the simulation space into a grid of cells and only interactions between molecules inside each cell and with nearby cells are calculated.
Molecules are composed of atoms which interact using different physical principles. This is approximated by having different sites in a molecule that use different potential models. The most common and simplest potential model is the Lennard-Jones potential.
MarDyn is a molecular dynamics simulator that has been developed at the Technische Universität München. See Martin Buchholz's thesis for more information.
Its code is rather ugly and it features some crazy design decisions. I have been told that the people who were initially developing it where learning how to code in C++ at the same time---and it shows :(
When I started, there was a 'working' implementation of Lennard-Jones potentials for molecules with only one site on the GPU. It used OpenCL and was the result from somebody else's IDP. However, it was not very useful: the code was crammed into one function and one letter variable names were used through-out.
The idea was to port the previous OpenCL implementation to CUDA and continue from there to add support for multiple sites with different potential models.
Porting the code to CUDA was a straight-forward API change. However, I found the original code to be impossible to read, maintain, and optimize. The logic behind the code was not clear or well explained.
Consequently my supervisor and I decided that I would rewrite everything and optimize it from the beginning. This took longer than expected and while the resulting parallelism and the logic behind it were clear in the code, code complexity was an issue. The code consisted of three helper functions and two kernel functions in two files and when I went and integrated support for multiple sites and potential models, it became clear that the current design was not clear enough to endure further feature additions and lacked the flexibility for quick changes.
Treating the old implementation as a prototype and knowing about many possible traps and fallacies, I set out and rewrote everything again. This time the focus was on modularity and separation of concerns instead of performance. Code architecture was the most important thing this time around. I scaffolded the new version around the old code, which was working correctly and already optimized. I embedded it into the new design step by step. Afterwards I optimized the code.
Sadly some of the optimizations didn't improve the performance noticeably. I think the monolithic, all-in-one approach I've used is the main issue.
My code uses the driver API, because it would be easy to dump all the CUDA calls and replay them later for debugging purposes.
If I could iterate over the code one more time, I'd try to use many small kernels instead of only two rather large ones. Currently the kernels sequentially calculate all potential models for the molecule pair that are needed (for each site). This makes it difficult to measure performance improvements and isolate bottlenecks. And I suspect it's slower than it has to be.
I'd also create a sandbox application to test and develop the kernels instead of using MarDyn as starting point, because it is unnecessarily huge and it is not really needed to verify that the CUDA code works (or not).
You can find the final paper that contains a description of my code and the performance results here, and my code here (I cannot include MarDyn's code because its source is closed, but at least I can share my code).
I just want to share two more links: