Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 27 of 47

kayvonf

Question: Can someone discuss how the modified iteration order improves cache locality in this example? Specifically, how much better is it?

sgbowen

The modified iteration order decreases the amount of capacity misses. In the previous order, a capacity miss occurred once in every 10 (output) elements processed (when moving to each new row), and, on average, three cold misses occurred on each row (necessary to load the row below the current row into the cache for the first time). In the new iteration order, there will be only two cold misses and no capacity misses with each row, but the rows are only 6 output elements long now. That's 20 cold misses per 60 output elements. However the old iteration had 24 misses in 60 output elements. So the second iteration order only has 5/6 the number of misses as the previous iteration order. These calculations aren't perfect, since they missed the fact that the second block column only has 4 output elements in a row, but it gives a close estimate (unless I did something wrong).

This iteration order is a little better, but we still suffer from lots of inevitable cold misses.

sanchuah

I saw a similar trick in Microsoft's project Adam, a distributed deep learning system.