Slide View : 15-418/618 Spring 2014

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 28 of 42

Back to Lecture Thumbnails

bstan

Question: if we were already given the layout on the left, wouldn't we have to recreate the data in the layout on the right to minimize the artifactual communication? For operations involving 2D arrays, it doesn't seem likely that the data would be organized like the example on the right to begin with.

This comment was marked helpful 0 times.

kayvonf

Correct. This slide suggests that the data be reorganized in the manner shown at right in order to reduce artifactual communication during parallel computation.

This comment was marked helpful 0 times.

gif

It is actually very common in performance critical applications to use this kind of array layout. There are generalizations to this 'block major' order such as Morton order. It makes some interesting reading.

This comment was marked helpful 0 times.

tianyih

I'm kind of confused about what this 'memory layout' refers to. So is it the data structure we defined to feed our programs? According to this slide, we should define data blocks corresponding to work assignments of processors?

This comment was marked helpful 0 times.

kayvonf

Memory layout refers to how data in an application is mapped to the address space. For example, in a 2D C-style array, data is mapped to the addresses in row major order. A[i][j] is stored at address A + i*WIDTH + j; However the developer might choose to map elements of a 2D array to addresses in a difference way.

Challenge: Can someone write some code to compute the address of element A(i,j) where (i=row and j=column) in the "4D blocked layout", if the size of a block is BLOCK_X * BLOCK_Y elements and the dimension of the overall 2D grid is GRID_WIDTH and GRID_HEIGHT? For simplicity, assume BLOCK_X evenly divides GRID_WIDTH.

This comment was marked helpful 0 times.

eatnow

I'd like to check if this is correct:

int i1 = i/BLOCK_Y;
int j1 = j/BLOCK_X;
int block_no = i1 * (GRID_WIDTH/BLOCK_X) + j1; 
// Address of start of block
int block_start = block_no * BLOCK_X * BLOCK_Y;
int i2 = i % BLOCK_Y;
int j2 = j % BLOCK_X;
return block_start + i2 * BLOCK_X + j2;

This comment was marked helpful 0 times.

Q_Q

An extreme example where blocked 4D array layout helps is when the working set doesn't fit in physical ram and some parts of it are paged out - if there were only a few threads working on many blocks, the row-major array layout would cause lots of pages to be paged into DRAM for each block, whereas the blocked array layout would only cause one or two pages to be paged in.

This comment was marked helpful 0 times.

nrchu

@eatnow, don't you have X and Y mixed up in your code? I think that you're finding A(j,i) instead.

This comment was marked helpful 0 times.