Just to rehash the benefit here:
Now each thread will keep a local partial sum (myDiff) which they increment in the for loop. This can be done in parallel without synchronizing with other threads. Then, when the partial sum is done, it uses locks to add to the global "diff".
This reduces the amount of sequential computation in the code because now the threads are more independent and so it can perform better.
Question: Imagine this program was being run on a big supercomputer with 100,000 processors (NUM_PROCESSORS = 100,000). How might you modify the implementation of the reduction to avoid performance loss due to too much serial execution?
Instead of doing the reduction serially you could possibly again do the reduction in parallel. You could say values processor 1 will handle the first 1000 processor values processor 2 the next 1000 and so on. Then these 100 processors could broadcast their own sum values to each other and calculate the sums and then rebroadcast the final sums to all the processors. The exact mapping of which processor becomes the reducer of which group of processors may depend on the architecture and chip design.
As an addition to previous discussion, I want to add that if the implementation uses a message-passing method, then following can be done: