Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 31 of 37

HLAHat

So, if there's a shared L3 cache, would it be a good move to just keep data there if all the processes will be writing to it frequently? Latency will be longer, but is it possible it would still be faster than dealing with lots of artifactual communication? Or do the other solutions just always work better?

rbcarlso

I suppose you could write a program where performance would be better on a machine that didn't have cache-to-cache transfer, but the key thing to remember is that the programmer has no control over how the cache works, so our decision shouldn't assume anything about the program that's going to be run. Just keeping everything on L3 may help some programs, but it will hurt a better written program that implements the same algorithm. I would say that in general cache-to-cache transfers are preferable because they provide better potential for performance.

fgomezfr

@HLAHat In general, I think you want to design code to avoid these situations. On a fast CPU where instruction execution is even faster than cache access, you may find that a sequential implementation out of L1 cache is faster than many cores hitting L3 for every read/write.

However, it's worth noting that GPU's actually implement a model much like what you've described, in order to avoid the costs of maintaining coherence over many separate caches. nVidia GPU's typically have a separate L1 cache for each SM (which is NOT coherent) and a shared L2 cache (which is always coherent, because it is shared with all cores). When coherent access to global memory is essential, CUDA programmers will bypass the L1 cache (for instance by using the "volatile" keyword) and serve reads directly out of L2. Here's a stackoverflow post with an example of this.

On a CPU, it might turn out that other approaches generally work better, but since GPU's focus on using massive concurrency to perform latency-hiding, this is a case where slightly higher latency can be more affordable than maintaining coherence.

kayvonf

For clarity: we shouldn't say a cache is coherent if it is a single shared cache serving all cores. Coherency is not an issue in this situation because, since there are not multiple caches, replication cannot occur and the coherence problem does not exist.

@HLAHat: I like your thinking, but I believe you've tricked yourself. In a system with multiple L1's and then a shared LLC (last-level cache... e.g., the L3 in your example), invalidation of the data in the L1 would not invalidate it in the LLC! So assuming "keeping the data" in the LLC would still require it to be moved into the L1 for access by the processor. You didn't save anything.

However, if you question is simply asking, could the overhead of coherence be so much that it's better to build a single far-away cache than a set of distributed, but coherent, local ones (e.g., avoiding the need to execute a coherence protocol), then I'd say yes, I'm sure one could concoct an application scenario that might prefer that type of design.

HLAHat

Ok. I just didn't know if it would be practical in today's processors. But it seems like programmers have no control over the cache's internal mechanisms anyway. However, as @fgomezfr mentioned, GPUs have shared memory and global memory which can do this since GPUs don't have coherency. So, apparently graphics processors do use this set-up (sort of)! It's just that CPUs don't typically need this set-up and it would be better to just change your program to get the performance benefits.