Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Josephus

What defines correspondence between an LL call and an SC call? For instance, is correspondence defined for a block of code on a given thread? Does an SC call correspond to whatever the previous LL call on any thread was?

bpr

@Joesephus, LL/SC may require that in a single execution context, there are no other memory operations between them. And that they are executed on the same memory address.

lol

I believe @Josephus was talking about LL/SC semantics? Based on what I found, LL/SC only depend on the traffic (reads/writes) at a memory address for each thread, so you can potentially do multiple LL and SC interleaved on different addresses. CMIIM

bpr

@lol, in the ARM guide on LDREX/STREX, it states that performing an LDREX to a different location will reset the monitor (i.e., cause the subsequent store to fail). Furthermore, if you think about an LL/SC pair as a transaction and allowed a single processor to interleave, how would the programmer handle one SC failing? Nesting of the operations is possible, although most hardware does not support this.

lol

@bpr Yeah, that makes sense. I was thinking along the lines of independent LL/SC calls could potentially be interleaved, as in if you perform LL1 SC1 and LL2 SC2 and they don't semantically involve other, then you could possibly also do LL1, LL2, SC1, SC2 or something along those lines with the same effect. But since they are independent, it makes more sense just to use the first (LL1 SC1, LL2 SC2) semantics.

bmperez

@lol @bpr That's correct, most hardware doesn't support nested LL/SC instructions because of how the underlying implementation is. On most architectures, the LL'ed address is stored in a single special register inside of the data cache, which naturally snoops on the bus traffic. Since there's only one register, you cannot nest the LL/SC pairs, because it will overwrite the address in the register, and it is not stored elsewhere.

Note that this is referred to as "weak" LL/SC. The "strong" version allows for nested operations. In hardware, strong LL/SC isn't practical due to the storage overhead requirements. However, there are some software implementations of strong LL/SC (referred to LLX/SCX). According to Wikipedia, the strong LL/SC is used in one of the best performing parallel implementations of a binary search tree (the paper is here).

yangwu

in terms of cache coherent: multiple processes will load x into cache line in share status, if one of these process fire BuxRdx, all others would need to invalidate the cache line, and when they all have to start from load_linked() operation again next time.

rds

On a cache coherent system, store_conditional could store only if the cache line containing that address hasn't been invalidated yet (it still has exclusive or shared status). If it has been invalidated, it implies that another processor wrote to that address since the last load linked.

sharangc

LL/SC is more difficult to emulate than CAS. Additionally, stopping running code between paired LL/SC instructions, such as when single-stepping through code, can prevent forward progress, making debugging tricky. More information can be found here.