This implementation essentially makes the code sequential because only one execution instance / "thread" can have access to diff at any given time. Thus, when a thread has access to diff, all of the other threads are waiting to acquire the lock and won't be performing any computation.
@MaxFlowMinCut: Just to clarify. The lock only makes portion of the code that updates the diff variable run sequentially. The threads can still run in parallel when they are processing the update of the grid cell.
I think the key bottleneck is due to frequent synchronization. As we implemented mutex in OS, it is non-trivial to lock/unlock a mutex, and contention can be an issue if there are many threads grabbing / releasing a mutex, which is likely to happen in this case.