Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

Previous | Next --- Slide 10 of 63

Iamme

I'm a little confused as to why the 4D blocking reduces time spent on communication between tasks. Doesn't every block need to access the same elements regardless of which order it accesses them in? Or is order key? Also how does this relate to cache blocks? Thanks!

tommywow

In terms of cache locality, it reduces a lot of misses in a 4D array layout. That is because each processor can fetch a chunk of memory and in the 4D array layout, the data is going to be in the same block. In the 2D row-major array layout, when a chunk of data is fetched, say the first row, the data is distributed over 4 different blocks, while the 4D array version won't have this problem.

Iamme

By blocks do you mean cache blocks? Where are you imagining the separations between cache blocks?

Maybe I'm mis-remembering things learned in 213, but I thought that normally a row of a matrix is most likely to share a cache block because it's a contiguous block of memory. So the different rows belonging to a processor in the 4D arrangement will probably not be in the same cache block.

jmackama

@lamme contiguous only works in your favor up until the block size, at which point the indexing and associativity matter more. Using a 4D arrangement allows you to hit more distinct indices if done right.

hanzhoul

4D memory layout can also be useful in reducing overhead between threads. Reducing memory accessing time can reduce the possibility that two threads have big difference in memory loading.