Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Perf Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 47 of 70

Back to Lecture Thumbnails

BensonQiu

This is a conflict miss. We can solve this by changing the data access pattern in our application.

kayvonf

@BensonQiu. Are you sure it's a conflict miss?

BensonQiu

@kayvonf: Would it be a capacity miss, since the working set is larger than the cache size? I was a bit thrown off because we can reduce cache misses in this example by changing our data access pattern (which corresponds to conflict misses in lecture 7, slide 041).

kayvonf

@BensonQiu. Correct, this is definitely a capacity miss because the working set is three rows of the grid, and three rows exceed the capacity of the cache. Reordering (as shown on the next slide) allows data in the cache to be re-used before it is evicted due to insufficient cache capacity. In other words, the reordering reduces what we call the reuse distance between successive accesses to the same value, so that less than a cache's amount of data is accessed in between... avoiding the capacity miss.

I do see the source of your confusion though based on that slide. In a conflict miss scenario, reuse distance might be very low, but in between two accesses to the same address X, access to another address Y caused eviction of X from the cache. Tricks to eliminate conflict misses (without changing the associativity of the cache) include: changing the data layout in memory to avoid X and Y mapping to the same cache line, changing program data access order to avoid the interjecting access to 'Y', etc.

xingdaz

@kayvonf, can you clarify how is the "three lines for every four elements" derived? Similarly "three lines for every eight elements" for next slide? Thanks!

kayvonf

@xingdaz: Given our implementation up to this point, the grid is processed in row-major order. So think about a large grid (N large) and computing element of a single row. For each four elements of this row, the system will load:

The cache line containing the four elements in the row
The cache line containing the four elements in the row above
The cache line containing the four elements in the row below

Therefore, for each four elements of output, three cache lines will be loaded. Now consider the blocked access pattern on the next slide. I claim there's twice as much data reuse.

cmusam

conflict miss (thrashing): a cache is repeatedly loading and evicting

[solution] change cache associativity; map X and Y to different cache lines

capacity miss: working set is larger than cache

[solution] increase cache size; reduce working set size