Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

TomoA

In this code, we want each thread to write to a separate memory location, so we don't want any communication between each thread. However, when a thread loads from/stores to an index into myPerThreadCounter, that thread will load the entire cache line. Thus the cache in one thread will also load in the memory that another thread will write to even though they are writing to different memory locations, and we will have to go through the different states described in previous slides and invalidate a thread's cache for the sake of coherence, despite the fact that each thread is independent.

The second code is better because we add enough padding to put each thread's index in the array on a different cache line, so there is no interference from other threads loading the line into its own cache.

rds

How do we know that the address of the struct is aligned to cache line size ? To clarify, how do we know that each PerThreadState is entirely placed in one cache line and not split amongst two ?

memebryant

@rds, compilers for systems languages typically have some way to specify alignment. For example, GCC has type attributes that solve this.

aeu

Are there ways compilers try to predict and adjust these paddings automatically?

memebryant

@aeu, a compiler probably wouldn't be allowed to make changes to the padding like this for a language like C where structs have a defined layout. In general, I don't know any languages that will try to automatically avoid false sharing at the compiler level, but I'm not an expert. It's plausible that languages with a focus on parallelism could support this, but this seems unlikely for a few reasons. I'm tired of typing on my phone so I'll leave them as an exercise for the reader.

cyl

It looks like here is a trade off between cache locality and false sharing here. For reading only, I think the padding has bad effect. For example, If we pad the memory like this, when we want to traverse the memory from pthreadState[0] to pthreadState[1000], we have to read 1000 lines from memory, while we just have to read few lines from memory without padding. I think the better way would be using the assigning chuck size for each thread but not padding, let each thread in charge for it's private chuck size. In this way, there won't be false sharing and the locality is preserved.

efficiens

Padding is bad when the only op is reading.

jsunseri

@rds You can also ensure alignment in general by allocating CACHE_LINE_SIZE-1 extra bytes and then actually start to store stuff at the first address that is a multiple of the cache line size, or using posix_memalign if you're using malloc.