Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

cwchang

The time spent on data loading is smaller on the right one. I'm thinking if this is because that we can make better use of the cache when working on tile-based layout?

muchanon

I also believe this is due to cache benefits. It is like cachelab in 15213 where you needed to zig-zag in order to complete the larger matrices.

What causes the synchronization cause to decrease? It seems like it has to be related to the traversal, so is it also the cache?

locked

If the computation in a grid shared most of the data, then this way of dividing work also reduces time because there are more cache hits.

boba

There is slightly less variance in the busy time in the second graph which might explain the decrease in synchronization time. I'm not sure why there is less variance in busy time though.

ask

The 4D layout has lesser time spent in i) Data Loads and ii) Synchronization.

i) Data Loads: 4D takes lesser time to load the working set because fewer cache lines need to be loaded to get the entire working set into the cache (fewer cache misses). This is because the block major ordering has better spatial locality than row major ordering for this working set.

ii) Synchronization: In the 2D layout, more cache lines are shared between processors as compared to the 4D layout. This gives rise to increased synchronization times because the same cache line may be present on multiple processors.

lfragago

One important thing to notice is that (as expected) the work done by both schemes is the same (as seen by comparing the Busy bars), what totally determines the performance difference are memory accesses!

fxffx

From the diagram, we can see that compared to 2D blocked layout, the 4D blocked layout spends significantly less amount of time in data loading. The synch time reduces too.