Can anyone explain in details why this leads to O(P^2) interconnect traffic? And does it relate to whether the cache coherence is snooping-based or directory-based?
Not sure if the reasoning holds: for each of the P processors, there are O(P) invalidations (for invalidating copies of cache lines in other processors' caches); there are in total P processors; imagine each processor holds the lock in turn, one round for everyone holding the lock once costs O(P*P) = O(P^2). If not all processors are using the lock (they only have a copy in their cache), then interconnect traffic can be less than O(p^2). Anyways, O(p^2) is an upper bound.
P^2 is an upper bound. What may occur is that when processor 1 releases the lock (which will invalidate the line in all other processors -- O(P) interconnect traffic), every other processor is notified and attempts to obtain the lock with a test-and-set. So these P-1 test-and-sets (all treated as cache writes regardless of whether successful) will each cause all processors to invalidate, leading to P invalidations across the board --> O(P^2) interconnect traffic.
(The problem is that all the processors try to obtain the lock at once. If there were, say, a queueing system for obtaining the lock, then there wouldn't be so many invalidations. See the "ticket lock" a few slides later.)
EDIT: oops, mistakes in my terminology
I'm not sure the purpose of underlining uncontended in this slide. What's the difference when the situation is contended?
@ZhuansunXt, if uncontended, "test" before test-and-set is wasteful, since there is only one cache invalidation anyway. However, if the situation is contended, "test" is helpful as it won't require exclusive access to the cache line and thus reduce the invalidation traffic overhead due to multiple t-s attempts.
Even though there's higher latency with the extra test, I wonder whether the extra latency is even measurable. It seems like the bigger latency issue would be the brach misprediction that would occur when the lock is finally released, but we have that same problem with the original test-and-set lock anyway, and is probably dwarfed significantly by the cost of memory coherence/interconnect traffic anyway.