In class, we talked about how a ticket lock is analogous to waiting to be served at the DMV. The critical section in this case would be the teller. If tellers simply shouted out every time they were free, then we would have a rush of people, and it would be a mess. Instead, each person takes a ticket with a number on it, and tellers increment the "now serving" sign when they are free. A person gets served only when their ticket number is the same as the "now serving" number. Thus, people are served in the order that they arrived (which corresponds to their ticket numbers), and only one person is served at a time. In code, each "ticket" and the "now serving sign" are integers.
This comment was marked helpful 0 times.
yanzhan2
Only one invalidation per lock release, but all the other processors needs to read the new now_serving variable, would that cause traffic? So traffic is O(P) refers to one lock release, or P release?
This comment was marked helpful 0 times.
sbly
@yanzhan The O(P) traffic refers to one lock release. There is only one invalidation, but that invalidation is seen by all P processors.
This comment was marked helpful 0 times.
yanzhan2
@sbly, thanks, can you also explain why P^2 traffic for test-test-and-set lock 2 slides before?
This comment was marked helpful 0 times.
sbly
@yanzhan This is the worst-case scenario. It is possible that after the lock is released, all processors break out of the
while(*lock != 0);
loop before any of them perform the
test_and_set(*lock)
code. In this case, all $P$ processors will execute the
test_and_set(*lock)
which will invalidate the cache for all other $P$ processors. Thus the total traffic is $O(P)*O(P) = O(P^2)$.
This comment was marked helpful 4 times.
yanzhan2
@sbly, thanks, it makes sense. But my question is fro all P processors execute test_and_set, after the first one is executed, there should be only one in M state, later P-1 test_and_test does not need to invalidate all P caches, 1 is enough. So I think it is O(P)+O(P) = O(P).
This comment was marked helpful 0 times.
sbly
It does need to invalidate all $P$ caches every time. All the cache have a copy of lock. Whenever a processor performs test_and_set(&lock;), it counts as a write (even if a write does not actually happen), so the cache line becomes invalidated and the other processors must retrieve the "updated" copy (though it is actually the same value).
This comment was marked helpful 0 times.
yanzhan2
@sbly, thanks, but I think the case is more accurately that each processor would execute test-and-set then enter the while loop, so the traffic is 1+2+..P-1, so O(P^2).
This comment was marked helpful 0 times.
tcz
I don't see why there is only one cache invalidation per lock release. During the critical section, all processors should in theory be sharing the lock's line in cache. Say processor 0 writes some value to the now_serving line. It would have to issue a BusRdX. This should invalidate everyone else's cache line, according to the MSI protocol from the consistency lecture. So, I think there would be one invalidation per processor per lock release.
This comment was marked helpful 0 times.
moon
Why might we not use a ticket lock? It seems considerably simpler than other spinlocks and ensures fairness.
This comment was marked helpful 0 times.
vrkrishn
If anything, the ticket lock does theoretically limit the number of processors in your system to the range of values that the number holding the ticket can reach.
For example, in an 8 bit ticket count variable, we can maintain a system that has 256 independent cores at maximum.
This comment was marked helpful 0 times.
vrkrishn
The main issue is that the system will map elements of the struct to consecutive locations in memory that will most likely be on the same cache line. Thus, whenever a processor acquires a lock (performs an increment on next_ticket), every processor waiting on now_serving will have to invalidate their cache line, leading to higher coherence traffic.
Another possible flaw is that limiting your next_ticket to a b-bit number will only work on a system with up to 2^b processors. For 32-bit there is probably no practical system that will hit this limit for a long time, but its still a concern to scalability
This comment was marked helpful 0 times.
yanzhan2
@tcz, one invalidation means exactly the same with what you said. It means one invalidation message send from the processor which releases the lock, like a BusRdX request. So the traffic is O(p), which means P invalidation messages are sent to P processors. I would say it is just the understanding is different, but it means the same thing.
This comment was marked helpful 0 times.
yanzhan2
@sbly, what you said is like an update CC protocol, and what I am confused at first is invalidation CC protocol. But the worst case scenario are both O(P^2) for TTS lock.
In class, we talked about how a ticket lock is analogous to waiting to be served at the DMV. The critical section in this case would be the teller. If tellers simply shouted out every time they were free, then we would have a rush of people, and it would be a mess. Instead, each person takes a ticket with a number on it, and tellers increment the "now serving" sign when they are free. A person gets served only when their ticket number is the same as the "now serving" number. Thus, people are served in the order that they arrived (which corresponds to their ticket numbers), and only one person is served at a time. In code, each "ticket" and the "now serving sign" are integers.
This comment was marked helpful 0 times.
Only one invalidation per lock release, but all the other processors needs to read the new now_serving variable, would that cause traffic? So traffic is O(P) refers to one lock release, or P release?
This comment was marked helpful 0 times.
@yanzhan The O(P) traffic refers to one lock release. There is only one invalidation, but that invalidation is seen by all P processors.
This comment was marked helpful 0 times.
@sbly, thanks, can you also explain why P^2 traffic for test-test-and-set lock 2 slides before?
This comment was marked helpful 0 times.
@yanzhan This is the worst-case scenario. It is possible that after the lock is released, all processors break out of the
while(*lock != 0);
loop before any of them perform the
test_and_set(*lock)
code. In this case, all $P$ processors will execute the
test_and_set(*lock)
which will invalidate the cache for all other $P$ processors. Thus the total traffic is $O(P)*O(P) = O(P^2)$.
This comment was marked helpful 4 times.
@sbly, thanks, it makes sense. But my question is fro all P processors execute test_and_set, after the first one is executed, there should be only one in M state, later P-1 test_and_test does not need to invalidate all P caches, 1 is enough. So I think it is O(P)+O(P) = O(P).
This comment was marked helpful 0 times.
It does need to invalidate all $P$ caches every time. All the cache have a copy of
lock
. Whenever a processor performstest_and_set(&lock;)
, it counts as a write (even if a write does not actually happen), so the cache line becomes invalidated and the other processors must retrieve the "updated" copy (though it is actually the same value).This comment was marked helpful 0 times.
@sbly, thanks, but I think the case is more accurately that each processor would execute test-and-set then enter the while loop, so the traffic is 1+2+..P-1, so O(P^2).
This comment was marked helpful 0 times.
I don't see why there is only one cache invalidation per lock release. During the critical section, all processors should in theory be sharing the lock's line in cache. Say processor 0 writes some value to the now_serving line. It would have to issue a BusRdX. This should invalidate everyone else's cache line, according to the MSI protocol from the consistency lecture. So, I think there would be one invalidation per processor per lock release.
This comment was marked helpful 0 times.
Why might we not use a ticket lock? It seems considerably simpler than other spinlocks and ensures fairness.
This comment was marked helpful 0 times.
If anything, the ticket lock does theoretically limit the number of processors in your system to the range of values that the number holding the ticket can reach.
For example, in an 8 bit ticket count variable, we can maintain a system that has 256 independent cores at maximum.
This comment was marked helpful 0 times.
The main issue is that the system will map elements of the struct to consecutive locations in memory that will most likely be on the same cache line. Thus, whenever a processor acquires a lock (performs an increment on next_ticket), every processor waiting on now_serving will have to invalidate their cache line, leading to higher coherence traffic.
Another possible flaw is that limiting your next_ticket to a b-bit number will only work on a system with up to 2^b processors. For 32-bit there is probably no practical system that will hit this limit for a long time, but its still a concern to scalability
This comment was marked helpful 0 times.
@tcz, one invalidation means exactly the same with what you said. It means one invalidation message send from the processor which releases the lock, like a BusRdX request. So the traffic is O(p), which means P invalidation messages are sent to P processors. I would say it is just the understanding is different, but it means the same thing.
This comment was marked helpful 0 times.
@sbly, what you said is like an update CC protocol, and what I am confused at first is invalidation CC protocol. But the worst case scenario are both O(P^2) for TTS lock.
This comment was marked helpful 0 times.