No, you want the largest possible block size that will cause the working set to fit in the desired cache (L1, L2, etc)
shhhh
You want as big of a BLOCKSIZE as possible because you want to grab as much memory at once as possible to generate as much work as possible to be run in parallel.
kayvonf
@shhhh. Let's clarify things. What do you mean by "grab as much memory"? (that sounds like an allocation), and what do you mean by "generate as much work as possible".
Compared with the previous slide, the overall amount of computation is the same, but the amount of communication (here is the data loading from memory) is much less since we calculate them in BLOCK_SIZE sub-matrix and better use data in cache. A BLOCK_SIZE that fills the sub-matrix data into the L1, L2 and L3 cache will achieve the best performance because of the best arithmetic intensity.
No, you want the largest possible block size that will cause the working set to fit in the desired cache (L1, L2, etc)
You want as big of a BLOCKSIZE as possible because you want to grab as much memory at once as possible to generate as much work as possible to be run in parallel.
@shhhh. Let's clarify things. What do you mean by "grab as much memory"? (that sounds like an allocation), and what do you mean by "generate as much work as possible".
Perhaps a nice way to describe things is in terms of arithmetic intensity.
Compared with the previous slide, the overall amount of computation is the same, but the amount of communication (here is the data loading from memory) is much less since we calculate them in BLOCK_SIZE sub-matrix and better use data in cache. A BLOCK_SIZE that fills the sub-matrix data into the L1, L2 and L3 cache will achieve the best performance because of the best arithmetic intensity.
If the block size is n * n, we have O(n) reuse.