Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Efficiently Evaluating Deep Networks

Previous | Next --- Slide 26 of 51

Back to Lecture Thumbnails

muchanon

No, you want the largest possible block size that will cause the working set to fit in the desired cache (L1, L2, etc)

shhhh

You want as big of a BLOCKSIZE as possible because you want to grab as much memory at once as possible to generate as much work as possible to be run in parallel.

kayvonf

@shhhh. Let's clarify things. What do you mean by "grab as much memory"? (that sounds like an allocation), and what do you mean by "generate as much work as possible".

Perhaps a nice way to describe things is in terms of arithmetic intensity.

anonymous

Compared with the previous slide, the overall amount of computation is the same, but the amount of communication (here is the data loading from memory) is much less since we calculate them in BLOCK_SIZE sub-matrix and better use data in cache. A BLOCK_SIZE that fills the sub-matrix data into the L1, L2 and L3 cache will achieve the best performance because of the best arithmetic intensity.

jk2d

If the block size is n * n, we have O(n) reuse.