A month has passed with Intel’s Acceler8 competition and it has finally come to an end. It’s been a long way from implementing the first sequential algorithm to having a fully-fledged parallel version.

I have never worked with Intel’s Threading Building Block library before and it was a nice opportunity to examine it, since it offered a better abstraction than OpenMP or pthreads.

The documentation is very good and you quickly learn how to work with the library. One of the caveats is that I didn’t have to use a low-level synchronization construct once in the development and everything worked fine without any race conditions or similar . The parallel_* functions (eg parallel_for, parallel_reduce, and parallel_scan) together with icc’s C++0x support (lambda functions) allowed for very concise code and little programming overhead.

The implementation builds on Kadane’s algorithm for the two dimensional case using prefix sums. One implementation that gets across the basic idea can be found here. Mine is similar and I simply parallelized as much as possible.

As you can see the outer two loops iterate over a two-dimensional range that is pretty much an upper right triangle of the whole possible domain. For this I’ve implemented a custom range that allows for better load balancing. A range in TBB defines an iteration range of any kind and supports a split operation that is used internally by the task scheduler to distribute the range dynamically on multiple threads as it sees fit.

Last but not least I came up with a way to parallelize the 1D part of Kadane’s algorithm that is being used by splitting the column range into linear subranges and merging the subsolutions into one solution, ie a classical divide and conquer approach.

Because it’s the most abstract yet interesting part of our implementation, I’m going to go into more detail here. :-)

How can you find the maximum subarray of a 1D array, if you know the maximum subarray of the two “halves” (they don’t have to be split evenly)? Well, you don’t, you need more information.

We calculate the following information for each chunk:

  • maximum subarray that starts at the beginning of the chunk
  • maximum subarray that ends at the end of the chunk
  • total sum
  • maximum subarray

It’s easy to figure out how to merge these values for two neighboring chunks into the values of the merged chunk. The merged maximum subarray that starts at the beginning of the merged chunk is either said value for the left chunk or the total sum of the left chunk + that value of the right chunk. You can figure out how it works for the maximum subarray that ends at the end of the merged chunk :-) The maximum subarray is just the biggest of all merged values or the left maximum subarray or right one.

Using this idea you can use a simple parallel_reduce to parallelize Kadane’s algorithm.

Of course, there is some overhead but for the right problem sizes this will be faster than the sequential algorithm as always.

Two more take-aways:

  • Always try to use language features like templates or lambda expressions to reduce duplicate code or make the code more concise.
  • Write unit tests. I have used googletest which is a very small but very capable library, and it has spared me a lot of debugging trouble.

Cheers :-)