Previous | Next --- Slide 4 of 43
Back to Lecture Thumbnails
cwchang

The working set of left version is of size Wx(H+2), while the working set of the right version is just Wx(CHUNKS_SIZE+2). Thus, for the left version, the working set can not fit in cache, while for the right version, we can always adjust the CHUNK_SIZE to make it fit in cache. That is the reason we choose the right version implementation.

paracon

The program on the right, reorders computation to improve producer-consumer locality.

hweetvpu

A note for the diagram in program 2. The first yellow block is of size (W+2)x(CHUNK_SIZE+2), which is then reduced horizontally to a tmp_buf of size Wx(CHUNK_SIZE+2), and then vertically to the yellow block of size WxCHUNK_SIZE. The fact that we use a "larger" for loop in program 2 makes locality much better than in program 1.

machine6

In a nutshell, program 2 performs better than program 1 due to the reordering of program structure such that we utilize cache locality to the maximum.

pdp

The central theme between previous and this slide is to re-use previously computed things as much as possible and reduce memory I/O.

srb

The general idea of producer-consumer locality is that something produced by one kernel being consumed immediately by the next kernel, so that data stays local instead of being committed to global memory, is optimal.