Previous | Next --- Slide 28 of 42
Back to Lecture Thumbnails
bstan

Question: if we were already given the layout on the left, wouldn't we have to recreate the data in the layout on the right to minimize the artifactual communication? For operations involving 2D arrays, it doesn't seem likely that the data would be organized like the example on the right to begin with.

kayvonf

Correct. This slide suggests that the data be reorganized in the manner shown at right in order to reduce artifactual communication during parallel computation.

gif

It is actually very common in performance critical applications to use this kind of array layout. There are generalizations to this 'block major' order such as Morton order. It makes some interesting reading.

tianyih

I'm kind of confused about what this 'memory layout' refers to. So is it the data structure we defined to feed our programs? According to this slide, we should define data blocks corresponding to work assignments of processors?

kayvonf

Memory layout refers to how data in an application is mapped to the address space. For example, in a 2D C-style array, data is mapped to the addresses in row major order. A[i][j] is stored at address A + i*WIDTH + j; However the developer might choose to map elements of a 2D array to addresses in a difference way.

Challenge: Can someone write some code to compute the address of element A(i,j) where (i=row and j=column) in the "4D blocked layout", if the size of a block is BLOCK_X * BLOCK_Y elements and the dimension of the overall 2D grid is GRID_WIDTH and GRID_HEIGHT? For simplicity, assume BLOCK_X evenly divides GRID_WIDTH.

eatnow

I'd like to check if this is correct:

int i1 = i/BLOCK_Y;
int j1 = j/BLOCK_X;
int block_no = i1 * (GRID_WIDTH/BLOCK_X) + j1; 
// Address of start of block
int block_start = block_no * BLOCK_X * BLOCK_Y;
int i2 = i % BLOCK_Y;
int j2 = j % BLOCK_X;
return block_start + i2 * BLOCK_X + j2;
Q_Q

An extreme example where blocked 4D array layout helps is when the working set doesn't fit in physical ram and some parts of it are paged out - if there were only a few threads working on many blocks, the row-major array layout would cause lots of pages to be paged into DRAM for each block, whereas the blocked array layout would only cause one or two pages to be paged in.

nrchu

@eatnow, don't you have X and Y mixed up in your code? I think that you're finding A(j,i) instead.