Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

jhibshma

Instead of keeping a flag bit for each processor, could we keep a counter of how many processors are accessing a line, plus one bit for whether a processor is modifying the line? This would use 1 + log p bits. What downsides/shortcomings does this approach have? I suppose this means you can't specialize sending an "I'm modifying this line" message to specific processors. Entering the shared state would be easy though, and so would checking if you could enter an exclusive read state.

CaptainBlueBear

@jhibshma that seems like it would negate the main advantage of directory-based cache coherence: minimizing communication overhead by not needing to communicate with all other caches. We need to know which processors specifically have access to a line so we can send communicate with them directly (see slides 10 to 20)

Josephus

Of course, lower values of P mean lower overhead. For example, 4 nodes means less than 1% overhead (w/ 64-byte cache lines), so a full bit vector directory should be fine for a quad-core PC.

msfernan

How do you calculate the overhead of a cache?

Josephus

@msfernan Divide P, the number of bits in a directory bit vector, by the number of bits in a cache line. For example, 64/512 = 0.125, 1024/512 = 2.00, and 4/512 = 0.0078125.

msfernan

@Josephus Thank you. I should have been clearer. I just wanted some intuition about why that is the case?

bmperez

@msfernan Well, the overhead is defined as the additional space required to store the directory information. Typically, this is expressed as a percentage, so as to be more meaningful and generalizable. Each cache line is uniformly sized, so we only need to consider one cache line to calculate the overhead.

The original size of the cache line is simply the amount of data stored in it (64 bytes = 512 bits here). Note that we ignoring the other information in the cache line (e.g. tag bits, dirty bit, etc.) and just considering the overhead with respect to the data. In a full bit vector representation, we need P bits for the directory entry, one per processor in the system to indicate if it is sharing the line, where P is the total number of processors. Thus, the overhead will be:

$$\frac{P}{512} * 100\%$$