Previous | Next --- Slide 27 of 35
Back to Lecture Thumbnails

What makes cmpxchg16b hard to implement? What prevents us from having cmpxchg instructions of arbitrary sizes?


@0xc0ffee Think about it this way: what prevents you from having cmp between instructions of arbitrary sizes? Why not 64bytes? Why not 128bytes? Well because the CPU can't do that in a single instruction. So, trying to do this for two 64-bit values, would mean that the thread ends up stealing the resource for more than one cycle, and it needs to make sure that no other thread takes it back. Imagine doing this for even larger values. At some point, it'll just make more sense to do locking. Perhaps this ``point'' is actually fairly small in the real world.


I'm not sure if Intel makes it publicly available how they implement compare-and-swap on a hardware level. However, according to a quick search, it seems that a common way to do it is to utilize the MESI cache coherency protocol.

If a single proc has E access to a line, it prevents any other thread from modifying the line without sending a bus message. If we had hardware support to prevent other processors from gaining M access if a different processor has E access, it would effectively be a "lock" on the that line, which we could use to implement compare-and-swap. In any case, as long as a line is in E access, we know we can safely do any set of operations on that line atomically.

This actually helps explain why we can't do arbitrary sized swaps. Cache lines are limited in size and synchronizing over multiple cache lines doesn't work with this cache coherency based locking.


Would the existence of a CAS that compare and swaps two values help the existence of lock free data structures? As an operation it would allow insertion or deletion into a linked list atomic and fast as it would be done in one operation.

Any ideas?


@GGOda Thats an interesting question. However seeing that locks can be implemented with CAS, and that since CAS can fail, you will have to spin until the operation is successful, it almost amounts to the same effect as having a lock implemented. The program will have to stall indefinitely until the instruction succeeds, which is similar in effect to having to wait for a lock? So the performance will be similar imo


Notably this paper from 1991 reports a lock-free multiprocessor OS kernel that requires a double-word CAS and claims performance benefits due to being lock free.