## Accelerate Your Code, Early 2012

Daniel H. and I took part in Intel's follow up competition to Acceler8: Acclerate Your Code, Early 2012. Again the goal was to parallelize an algorithm to make it run as fast as possible. The task that needed to be parallelized can be summed up as follows:

    Compare 1 reference DNA sequence (in a file) with a variable number of DNA sequences (from 1 or more files). A DNA sequences is composed of 4 characters : A,T,C,G. Find all substrings of >= N characters that match exactly.

We made 7th place out of over 500 participating teams worldwide.

## Our solution

I’m going to write about our submission and what we did to parallelize and optimize Intel’s reference implementation, which I've uploaded here.

Upfront, let me state that in the end we did not use inline SSE or other low-level optimizations. We only used Intel’s Amplifier and Inspector to profile and debug our code, and the former only very late in the process, adhering to the formula “Write first, optimize later where needed”.

Candreolli’s article gives a detailed overview of the problem, so I’m not going to go into more details here. If you really want to understand everything, I recommend looking at the Wikipedia article and, most importantly, the reference source code, which contains some specialized “shifting” conditions for ranges of duplicated characters. These conditions caused us some headaches and greatly influenced our approach.

## Divide and conquer

The idea we followed for parallelizing the algorithm is a simple divide-and-conquer scheme: We assume that the input sequence is longer than the reference sequence and split it into multiple parts which are compared in parallel. The partial results are then merged to yield the final results.

The merging is non-trivial. It is possible that a match is split on the border between two parts of the input sequence, so we have two matches that have to be fused together correctly. Moreover, it is possible that either of the matches isn’t actually _minMatchLength _characters long (minMatchLength = N from the problem description). We solve this by adding redundancy: all the parts overlap slightly, so that every border match of the minimum length is included correctly in one of the parts.

Last but not least, it is important to notice the symmetry of the problem and realize that you can switch input and reference sequence without changing the results. This makes it possible to enforce our assumption that the input sequence is longer than the reference sequence and ensures that the longer sequence is always split and distributed on different threads.

We tried to implement a solution from zero and repeatedly failed, because of the weird shifting behavior of the reference implementation. There are some threads on Intel's forum about it, and without stating the exact code it is difficult to explain the behavior without omissions, so please look at the reference implementation to see the whole mess :-)

In the end we decided to stick to the reference implementation and use very simple changes that can be verified and tested easily to ensure that our code does not deviate from the reference behavior. We wrote extensive tests using the unit test framework GoogleTest and created lots of sample data for special cases to test our code against.

## Reference sequence hash

Using a hash map was suggested by Intel in several places, so this was one of the first things we looked at after implementing the basic parallelization described above. Our basic idea for hashing is the following: if you have a minMatchLength of eg 32 or 33 characters, you can split the reference sequence into blocks of 16 characters and you know that any match with the input sequence must contain at least one entire block somewhere. This follows from the pigeon-hole principle.

Now what we do is, we hash every block of _blockSize := floor( minMatchLength / 2 ) length _in the reference sequence and store its position. We partition the input sequence into blocks and loop through it block by block, ie with blockSize as step size. We look-up each block in the hashtable and use an array to chain matched blocks together where possible. If such a chain cannot be continued, we expand the chain character by character on the left and right and check if its length is at least minMatchLength characters. If so, we have a valid match.

The hash function is very simple: we transcode A, C, T, G to 0, 1, 2, 3 on reading the file (using mmap), and simply use 2 bits per character to compress a block. We also directly use this form for computing a block's hash value, but we start with the characters at the end of the block first. This way the hash value already allows us to compares the end of block which increases the chance that strcmp fails early on for hash collisions. We have tried other general hashing schemes but they weren’t much faster, so we stuck to it :-)

The transcoding from A, C, T, G to 0, 1, 2, 3 was the only place where we used the profiler to optimize the code: we used several ifs at first but switched to a lookup table to reduce branch mispredictions.

## State machine for result shifting

The special “shifting” conditions mentioned before caused some headaches. Eventually we applied KISS and simply rebuild the data structures (the L array) from the reference implementation during each iteration and used the original code from the reference implementation to shift the results when necessary.

This was very slow, of course, but it was the right ansatz. With this as starting point, we looked at our chained block array and changed it to a sparse array (ie an STL vector containing (endIndex, matchLength) pairs). This gave us a nice speed-up. Moveover, we looked at the L array and saw that the shifting logic works on diagonals in the array and only uses the neighboring diagonals to decide whether to ignore a valid match or not. It was straightforward to replace rebuilding the L array with a simple “state machine” that only needs to know the length of diagonals and how many characters it has been expanded to the right. Each valid match or continued chain is pushed into this state machine sequentially and the simple algorithm inside does the rest.

## Final optimizations

Instead of using an STL hash map, we wrote our own flat hash map. We know the number of buckets and the number of entries up-front, so we can create two arrays: one containing the buckets, and one containing the start index of each bucket and we fill them using a fast counting sort (which is O(N)). We also sort each bucket lexicographically, so we can use a binary search during look-up. This resulted in another significant speed-up.

The code that merges results from different parts of the sequence used a multimap at first. We changed this to several std::vectors, and got another small speed-up.

The current code contains an optimization that only occurred to me after the deadline: to correctly handle sequences containing duplicated characters, eg AAAA…, I stored all ranges of duplicated characters in an array while loading the file. During parallelization the code ensures that no such range is split, because it would confuse the shifting logic. However, this is rarely used and the memory allocations for the array incurred a noticeable overhead. I removed all that code and changed it to simple move the part boundaries right if the characters left and right of the border of a part are equal.

The guys from Intel were so kind to benchmark this version, too, and it was 20% faster. So that is something we lost out on in the submitted version.

## Code

You can find our code here.

## Acknowledgement

We want to thank our university for giving us access to a 16 core workstation, which we used for testing.

## Performance

I want to share 3 benchmarking results.
This runs the default test (237MB) in test_input3 on a 16 core machine:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.21.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.22.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.X.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.Y.fa > myoutput.txt

real: 1.15 user: 4.61 sys: 0.45 Percent of CPU this job got: 437%

It seemingly doesn't scale very well. If we put in all DNA data (~2.7GB), it starts looking better:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* > myoutput.txt

real: 4.44 user: 53.54 sys: 2.98 CPU: 1271%

And it scales: with 6x input (~16.2GB), we get:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.<br></br>dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* > myoutput.txt

real: 23.61 user: 328.39 sys: 14.38 CPU: 1451%

Ie only with 16.2 GB we start to saturate the cores and if we compare the data size to the real time values, we get a throughput of:

 Input size (in MiB) 239 2799 16794 Throughput (in MiB/s) 208 630 711

We see strong increase from 239MiB to 2.8 GiB, and it even gets better at ~17 GiB of input data. All in all, throughput is peaking at around 700 MiB/s.