In the first phase (horizontal), we blur pixels in the x and y direction, read values from input and write to temp in row major order. We don't load any unnecessary data from cache, so there is good spatial locality here. In the next phase (vertical), we read elements vertically, so every read is on a new cache line. The next iteration of the i loop will continue to read elements that were in the same rows as the elems we read in the previous iteration, so the working set is 3 cache lines. We can get perfect cache utilization if the cache can fit three rows of the image. In general, the cache behavior in this situation is good. Any suboptimal behavior would be due to the fact that we write extra data out to memory and read it back in (double memory bandwidth needed).