Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

lilli

This additional while loop allows us to spin on the cache, never sending BusRd requests to other nodes.

yeq

This implementation is more efficient in that for most of the time, while (*lock != 0); is just a repeated read on the cache line in the current processor, so it does not trigger invalidation of cache lines in other processors.

chenboy

After the first cache miss, while (*lock != 0); will not cause communications via bus until other processors modify the value because it read the value directly from it's own cache, which is desirable.

kapalani

There's a subtle point that this slide mentions that I hadn't thought about before regarding how the lock variable should not be register allocated for correctness reasons to ensure all the processors see the update to the lock variable and don't use the value of lock that is cached in the register. Since registers are a sort of cache for the L1 cache itself, is there a sort of "register coherence" (not sure if this is the right term since each thread has its own independent set of registers and the same variable can be in different registers in different threads) or is this achieved just by declaring the lock variable as volatile, therefore forcing the compiler to generate code that would always translate a read from lock into a load from memory rather than reading from a register?

rootB

It reduces bus contention than the previous version, because t&s is a write operation which tries to issue BusRdX and forces other processes to drop the cache line containing the lock variable every time the process tries to obtain the lock. Now it's a read operation so it's allowed that multiple processors share the cache line, and they only t&s after the lock is released.

lfragago

The problem with calling test_and_set in the loop is that we always issue a write (regardless of whether we need or not to actually change the lock), which causes traffic across the memory bus. By first checking if the lock is '0' we are just reading in the processor's local cache copy and avoiding issuing unnecessary BusRdX across the bus,

shpeefps

In 15-410 our lock implementation was of the sort:

while(test_and_set(*lock) != 0) yield(the_lock_owner);

return; This is was done so that the owner of the lock could get out of the critical section sooner so that other processes waiting on the lock would have access. Now if we combined the ideas from 15-410 and this slide, we would have:

while(*lock != 0) yield(the_lock_owner);

if(test_and_set(*lock) == 0) return

Is there any reason we would not want to combine the two aforementioned ideas? Is there anything that could go wrong with this new implementation?

sadkins

Even if atomic compare and swap doesn't succeed in writing a value, the cache still considers this a write, so the line is invalidated. By spinning on the lock value, we keep the line in the shared state until it is invalidated by the thread that holds the lock. Therefore we only compare and swap at most once per unlock/lock step