You want as big a BLOCKSIZE as you can get without exceeding cache constraints. (A conservative estimate is that 3*BLOCKSIZE$^2$ needs to be less than the cache size, so that you can fit the working blocks for all 3 matrices at once.) A big blocksize is better because you do O(BLOCKSIZE$^3$) math but only O(BLOCKSIZE$^2$) loads per block, so higher blocksize means higher arithmetic intensity.
zvonryan
In this situation, a block in A or B might be read from memory many times, right? Since many blocks in C would require the same block in A or B.
kayvonf
@zvonryan. That is correct (an unavoidable given the nature of the algorithm). However, the hope is that if the BLOCKSIZE was sufficiently large, then the arithmetic intensity of the program could be pushed sufficiently high that they would not notably impact performance. For example, on a multi-threaded system the loads of A and B blocks could be overlapped with computation.
hofstee
Would it be better or worse to have C in cache? If you did streaming non-allocate writes using SIMD then I think you could write out an entire cache line at once with AVX-512, so I think we could get away with 2*BLOCKSIZE$^2$.
stride16
Is this essentially the same concept we use in 213's cache lab, in transposing matrix? Basically, choosing a big blocksize to ensure higher mathematical intensity, while ensuring that the block properly fits into the cache.
You want as big a BLOCKSIZE as you can get without exceeding cache constraints. (A conservative estimate is that 3*BLOCKSIZE$^2$ needs to be less than the cache size, so that you can fit the working blocks for all 3 matrices at once.) A big blocksize is better because you do O(BLOCKSIZE$^3$) math but only O(BLOCKSIZE$^2$) loads per block, so higher blocksize means higher arithmetic intensity.
In this situation, a block in A or B might be read from memory many times, right? Since many blocks in C would require the same block in A or B.
@zvonryan. That is correct (an unavoidable given the nature of the algorithm). However, the hope is that if the
BLOCKSIZE
was sufficiently large, then the arithmetic intensity of the program could be pushed sufficiently high that they would not notably impact performance. For example, on a multi-threaded system the loads of A and B blocks could be overlapped with computation.Would it be better or worse to have C in cache? If you did streaming non-allocate writes using SIMD then I think you could write out an entire cache line at once with AVX-512, so I think we could get away with 2*BLOCKSIZE$^2$.
Is this essentially the same concept we use in 213's cache lab, in transposing matrix? Basically, choosing a big blocksize to ensure higher mathematical intensity, while ensuring that the block properly fits into the cache.