Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Cake

In the first example, elements that were accessed by different threads happened to be stored in the same cache line. Therefore, between writes, modified cache lines would have to be flushed to memory before another cache can transition from invalid into modified state.

However, in the second example, every thread accessed memory on different cache lines (you can see from the padding characters), so no transitions from modified to invalid were required, drastically increasing the efficiency of the program.

This demonstrates that sometimes, "saving" memory might not be the best thing for performance. (Also that you should know your hardware well when trying to optimize code).

tkli

This is very interesting behavior. I wonder how these issues are handled in practice?

dasteere

I wonder if it would be possible to make this inherent in the language. Many programmers would easily make this mistake, so if it would be automatically done in the language it would make alot of naively written code faster

ayy_lmao

What is the purpose of the volatile qualifier on counter within the worker thread. My basic knowledge of the qualifier from Java is that it requires the variable to never be loaded into cache, which seems to not be the case in this example, as that would make the execution time of both test1 and test2 the same.

locked

In general, OS related program often involves counters which are frequently updated. If counters owned by different threads are being mapped to the same cache block, then each time a different thread access its counter the old counter value from other threads will be evicted and cause false sharing. It can be solved by padding each counter to fill an entire cache block but this will waste some memory space. Another way to optimize it is if one thread owns several counters, then we can group all counters that belong to the same thread into an array and pad that array instead.

-o4

@ayy_lmao not sure about the caching behavior but to my understanding volatile simply means to tell compiler do not optimize around this variable.

And to quote from this link:

2) Global variables within a multi-threaded application: There are multiple ways for threads communication, viz, message passing, shared memory, mail boxes, etc. A global variable is weak form of shared memory. When two threads sharing information via global variable, they need to be qualified with volatile. Since threads run asynchronously, any update of global variable due to one thread should be fetched freshly by another consumer thread. Compiler can read the global variable and can place them in temporary variable of current thread context. To nullify the effect of compiler optimizations, such global variables to be qualified as volatile.

anonymous

This mechanism makes me think of "padding" for memory storage for computer graphics.

sandeep6189

One interesting behaviour in writing parallel/cache-coherent code is that it is tough to write portable code since it involves a lot of hardware specific code.

pk267

This could also be a good example for time v/s space tradeoff: greater amount of memory is being used (one entire cache line for every element of the padded array), but we are getting lower execution times by reducing the number of cache misses on every access.

life

For the version on the left, the counter array is very likely to be fit into one cache line. When different threads running on different cores share access to this array, false sharing will happen. However, for the version on the right, different padding_t is guaranteed to be seated on different cache lines and thus false sharing is avoided

taz

I agree with @dasteere that the padding implementation takes the programmer a lot more effort and knowledge about the caches. It's interesting how unintuitive and messy the padding implementation is, but that it makes the code 3 times as fast.

ctabrizi

If our cache line is big enough, we could trade the memory costs of making this optimization in exchange for more complex code by putting (CACHE_LINE_SIZE / sizeof(int)) number of ints inside each struct (instead of having one int per struct and filling the rest of the cache line with padding).

I agree with the general sentiment that there seems to be a lot of specific knowledge that can really benefit programmers trying to get the most performance out of their code.