Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Fine-Grained Synchronization and Lock-Free Programming

Previous | Next --- Slide 4 of 37

apk

If the lock is taken by another thread, atomicCAS spins and keeps writing the old value to *addr. The implementation on the right saves from this redundant writes.

jmc

In particular, the atomicCAS is always a write, so multiple cores trying to acquire the lock will lead to thrashing the cache line between their caches, since each one will need to invalidate the others to obtain exclusive write access for the operation. Whereas on the right, since we spin on a read, no invalidation occurs until the lock is released (when *l == 0).

nemo

The implementation on the right is 'Test and Test and Set lock' as saw in the previous lecture. It is more scalable, generates less interconnect traffic, but (slightly) increases the latency in acquiring the lock.

shhhh

Ignoring the fact that atomicCAS is always a write, the second implementation is still more efficient because of the atomic action alone. Atomic actions are costly while a check on the lock is cheap, so it is more efficient to perform atomic actions only when necessary.

bjay

http://code.metager.de/source/xref/gnu/glibc/sysdeps/unix/sysv/linux/x86_64/lowlevellock.h

Looking at the ASM implementations of locking low-level locks, we can see that each snippet does in fact use the x86 instruction cmpxchg!

kayvonf

@bjay. Well, I wasn't lying to you. ;-)