If the lock is taken by another thread, atomicCAS spins and keeps writing the old value to *addr. The implementation on the right saves from this redundant writes.
In particular, the atomicCAS is always a write, so multiple cores trying to acquire the lock will lead to thrashing the cache line between their caches, since each one will need to invalidate the others to obtain exclusive write access for the operation. Whereas on the right, since we spin on a read, no invalidation occurs until the lock is released (when *l == 0).
*l == 0
The implementation on the right is 'Test and Test and Set lock' as saw in the previous lecture. It is more scalable, generates less interconnect traffic, but (slightly) increases the latency in acquiring the lock.
Ignoring the fact that atomicCAS is always a write, the second implementation is still more efficient because of the atomic action alone. Atomic actions are costly while a check on the lock is cheap, so it is more efficient to perform atomic actions only when necessary.
Looking at the ASM implementations of locking low-level locks, we can see that each snippet does in fact use the x86 instruction cmpxchg!
@bjay. Well, I wasn't lying to you. ;-)