Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Parallel Programming Case Studies

Previous | Next --- Slide 36 of 59

Back to Lecture Thumbnails

oulgen

Can someone explain this blocked version of scan?

ericwang

In my understanding, this method is based on the assumption that a SIMD scan on 32 elements is very fast (because of 32-wide SIMD execution). So every 32 elements would be a basic warp unit.

For each "SIMD scan warp", the scan result is an array of the 32 numbers (the total summary is at the end of the scan result array). Once we got 32 summaries by doing 32 "SIMD scan warp", we can do an extra "SIMD scan warp" on the list of the summaries. So that each summary in the array will be added by all previous summaries in this list. Then we add this processed summaries back to each unit warp. And all numbers in the warp units are added by the previous numbers' summary. That's the work flow for a block.

If we have more then one block, just do the same "SIMD scan warp" based on the blocks' summaries. Like recursive call. Just remember we need to add the scan result back to each block.

Elias

Adding on to that: the basic idea is that if you scan consecutive blocks of an array, you can recursively scan your partial sums, and then map an add of those partial sums over the blocks which generated them. For example, say I had an array:

A = [1,2,3,4,5,6,7,8].

If I scanned in blocks of size 3, I'd generate partial scans:

[0,1,3] => 6 [0,4,9] => 15 [0,7,15] => 15

If we scan those partial sums, we'd get: [6,15,15] => [0, 6, 21].

So let's go back to the original (block scanned) array: a A' = [0,1,3,0,4,9,0,7]

If we add in our scanned partial sums to their blocks, we get:

A'' = [0, 1, 3, 6, 10, 15, 21, 28]

Compare to the original array:

A = [1,2,3,4,5,6,7,8]

Scanned!

You can see how this recursively let's us reduce an arbitrarily large scan to SIMD-wide scans (or warp level scans, on a GPU).

admintio42

So a larger scan is essentially breaking the large scan into lots of smaller ones and then scanning the result.