Question: In class a clever, but not-to-be-publicly named student found a bug in my initial implementation, which is now fixed. The original version of pop() flipped the order of the first two lines. The buggy version is below:
Node* top = s->top;
int pop_count = s->pop_count;
Can someone describe why (give an example) of how this buggy implementation fails?
The original slide in the lock-free data structures lecture has also been fixed.
This comment was marked helpful 0 times.
Say we have two processors, $P1$ and $P2$, and the top of the stack is node $A$. In the buggy version, $P1$ executes line 1, grabbing a pointer top to $A$. Then before it executes line 2, $P2$ changes the stack, but $A$ still ends up as the head of the stack (ABA problem). Then P2 executes line 2, and gets the pop_count for the corrupted stack. Assuming $P2$ does nothing else in the meantime, $P1$ finishes pop() and the CAS passes because pop_count hasn't changed and $A$ is still the head of the stack. However, the stack has changed since $P1$ got a pointer to $A$, so the results are incorrect.
I still do not see why this is a problem. For @sbly's explanation, I do not understand why P2 executes line 2 after it changes the stack. If P2 changes the stack, it would have executed the pop function successfully for at least once, so the count is already changed, then why P2 is necessary to call the pop function again? I do not see the meaning of this.
Also, even though A is still the head of the stack, why the result is incorrect?
@kayvonf Although I do understand why the buggy implementation is buggy, I'm not sure if I can produce a scenario where the buggy implementation's behavior would be different from the modified version's. Because we grab new_top after we grab pop_count, I don't think we can get a situation like the last slide's. Can anyone produce a scenario where the two implementation will behave differently?
EDIT: I actually went to @kayvonf's office hours to discuss this, and here's what we figured out:
Let's say we have the buggy implementation in play here.
Observe the following stack:
A - B - C
s->pop_count = 0
P1 comes into pop, and set its local variable top to A. Note that pop_count is still zero. Then, we context switch to P2. P2 pops A (pop_count is now 1), pushes D, and pushes A back in to the stack again. So, the stack is now
A - D - B - C
s->pop_count = 1
We resume the execution of P1, going to line int pop_count = s->pop_count, where we set P1's local pop_count equal to 1. HOWEVER, we now go onto say new_top = top->next; which IS actually D.
So, in conclusion, we figured out that as long as we have the int pop_count = s->pop_count BEFORE we grab new_top, we won't be able to corrupt the stack like the last slide.
It would be interesting to find a situation where the buggy implementation and the non-buggy implementation actually have different results, as I mentioned before the EDIT.
This comment was marked helpful 1 times.