Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 31 of 44

jmc

At an abstract level, we can think of the array-based lock as implementing a linked-list of processors. With the ticket lock, each processor held a ticket with a number and had to keep checking their number against the lighted sign which shows what number is currently being served. With an array-based lock, each processor has a pointer to the processor that will be served next after it, and notifies that processor when it's its turn.

At an even more abstract level, this is kind of analogous to snooping vs. directory-based cache coherence. Instead of "broadcasting" to everyone when we move to the next number, we send a point-to-point message to the processor who will be served next.

rootB

@jmc, I think ticket lock is also very similar to notifying next processor when releasing the lock, both are implemented in FIFO fashion, and in both cases every processor still actively checks that my_ticket equals now_serving/status of its own my_element in the while loop (not exactly like the broadcasting vs point-to-point analogy).

The major difference of the two is that array-based locks spin on unique memory address, so when the lock is released, only one cache invalidation will occur instead of P invalidations in the ticket lock case. This reduces the network traffic at cost of extra storage space.

jmc

@rootB, You're right that in both cases every processor is spinning checking some flag. What I was comparing analogically to broadcasting vs. point-to-point is the cache invalidation behavior + cost that results when the lock is released. In the ticket lock, on lock release, P invalidations occur. This is analogous (at an abstraction level only, not in implementation) to broadcasting a message to all processors to perform the expensive synchronization operation (test-and-set + cache invalidation). On the other hand in the array-based lock, on lock release, only one invalidation occurs. This is analogous to a point-to-point message (albeit in implementation reality a point-to-point message is not necessarily sent).

I admit this was a very abstract connection, not necessarily useful, but just something that occurred to me.

pk267

Even in this case there will be some broadcast because a set of processors which have consecutive numbers will share the cache line.

Edit: Oops, my bad. Thanks @rootB for pointing it out. The lock is kept at different cache lines.

rootB

@pk267, I think padded_int is designed to be located in different cache lines even for those with consecutive indexes.

kapalani

Why does atomic_circ_increment have a much higher overhead than just the regular atomic_increment? Can it not be implemented as just with regular atomic_increment and % operator? Is it because of the divide instruction that would be involved if P is not a power of 2?

Also why is the storage cost linear in P? Shouldn't it be unbounded because you could have any number of threads running on one processor that try to acquire the lock? Or does the use of an array based lock require some kind of special kind of synchronization for a thread to identify that some other thread on the same core it is running in has already attempted to take hold of the lock and hence it should deschedule itself?

mak

Does this implementation assume there would be at max P threads (one thread per processor) contending for a lock?