Previous | Next --- Slide 32 of 35
Back to Lecture Thumbnails
Xelblade

In the first case, many counters would lie on the same cache line as ints are only 4 bytes and they are contiguous in memory. There's lots of cache coherency issues requiring lots of communication to organize it since a single load will load variables that should be modified by threads other than the current one.

In the second case, we add junk bytes to reach 64 bytes so we don't have cache line overlap, trading memory for reduced communication.

This is called false sharing, as explained further on the next slide.

gbarboza

It appears that using the GCC align keyword is the way to do this.

See this article.

kayvonf

Question: Could someone elaborate on @Xelblade's statement about "lots of cache coherence issues?"

gbarboza

If we had int myCounter[NUM_THREADS], we could reason that the entire array will fit into one cache line. So when each processor went to update its corresponding counter element in the array, it would first pull the line into its L1 or L2 cache. Since each processor has its own L1 and L2, we then have the case that each processor is modifying a cache-local copy of myCounter.

The issue that arises from this is that when one processor writes to its unique element of myCounter it causes all the other processor's cache-local copies of myCounter to become stale (because they all share the same cache line!). Therefore, the modifying processor has to push its changes back to main memory, and all the other processors have to pull in fresh copies from main memory before they can make their own changes to the array.

kayvonf

@garboza: You got it. One clarification I'll make is that in MSI or MESI, the line would go all the way out to memory and then back to the requesting processor as you mention. But in MESIF, or MOESI, cache-to-cache transfers would eliminate some of these memory transactions.

Nonetheless, the issue at hand is exactly what you state: That the line is frequently being invalidated in processor caches due to writes by other processors to different data on the same cache line. This is false sharing!

markwongsk

Question: In assignment 2 we used a similar array to indicate whether or not a circle was relevant (the array of booleans was shared among threads). However, this didn't cause issues because we were using GPUs which do not implement cache coherency right?