Intel's Acceler8 competition was about solving the following problem as fast as possible using parallelization:

Given a 2-dimensional array of natural integers (between -32000 and +32000), the Maximum Subarray Problem asks for the rectangular area within the array that maximizes the sum of the array elements found in the area.

I worked on it together with Armin Krupp and we made 25th place worldwide. You can find the ranking here (our team name is AK^2 - fitting, isn't it? :)).

We have used Intel's Threading Building Block library and it was a nice opportunity to examine it as we haven't worked with it before.

It offers a better abstraction than OpenMP or pthreads. Moreover, its documentation is very good and you quickly learn how to work with the library. We didn't have to use a low-level synchronization construct once during the development and everything worked fine without any race conditions. The **parallel_*** functions (eg **parallel_for**, **parallel_reduce**, and **parallel_scan**) together with icc's C++0x support (lambda functions) allowed for very concise code and little code overhead.

Our implementation builds upon Kadane's algorithm for the two dimensional case using prefix sums. One basic implementation that gets across the basic idea can be found here. Our optimized implementation is similar: we have simply parallelized as much as possible.

As you can see the outer two loops iterate over a two-dimensional range that is pretty much an upper right triangle of the whole possible domain. For this, we've implemented a custom **range** in TBB that allows for better load balancing. In TBB, a **range** refers to an iteration range of any kind and supports a **split** operation that is used internally by the task scheduler to distribute it dynamically onto multiple threads as it sees fit.

Last but not least we came up with a way to parallelize the 1D part of Kadane's algorithm by splitting the column range into linear subranges and merging the subsolutions into one solution, ie a classical divide and conquer approach.

Because it's the most abstract yet interesting part of our implementation, I'm going to go into more detail here:

How can you find the maximum subarray of a 1D array when you know the maximum subarray of the two "halves" (I call them halves even though they don't have to be split evenly)?

Well, you don't: you need more information. We have to calculate the following information for each chunk:

maximum subarray that starts at the beginning of the chunk

maximum subarray that ends at the end of the chunk

total sum

maximum subarray inside the chunk

It's straightforward to merge these values for two neighboring chunks:

The merged maximum subarray that starts at the beginning of the merged chunk is either said value for the left chunk or the total sum of the left chunk + that value of the right chunk. The merged maximum subarray that ends at the end of the chunk is calculated likewise.

The maximum subarray is just the biggest of all merged values or the left maximum subarray or right one.

Using this idea, you can use a simple **parallel_reduce** to parallelize Kadane's algorithm.

Of course, there is some overhead but for the right problem sizes this is faster than any sequential algorithm.

You can find our code here.