Now each thread will keep a local partial sum (myDiff) which they increment in the for loop. This can be done in parallel without synchronizing with other threads. Then, when the partial sum is done, it uses locks to add to the global "diff".

This reduces the amount of sequential computation in the code because now the threads are more independent and so it can perform better.

kayvonf

Question: Imagine this program was being run on a big supercomputer with 100,000 processors (NUM_PROCESSORS = 100,000). How might you modify the implementation of the reduction to avoid performance loss due to too much serial execution?

Perpendicular

Instead of doing the reduction serially you could possibly again do the reduction in parallel. You could say values processor 1 will handle the first 1000 processor values processor 2 the next 1000 and so on. Then these 100 processors could broadcast their own sum values to each other and calculate the sums and then rebroadcast the final sums to all the processors. The exact mapping of which processor becomes the reducer of which group of processors may depend on the architecture and chip design.

yimmyz

As an addition to previous discussion, I want to add that if the implementation uses a message-passing method, then following can be done:

For any process with $pid \mod 1000 \neq 0$, broadcast the partial sum to the highest non-greater pid that is multiple of 1000.

For any process with $pid \neq 0$ and $pid \mod 1000 = 0$, wait for the messages from $(pid + 1)$ up to $(pid + 999)$ containing their partial sums. After all arrive, sum them up and report to pid 0;

For pid = 0, sum up the partial sums reported from pid 1000, 2000, ..., 99000. This is the final sum.

Just to rehash the benefit here:

Now each thread will keep a local partial sum (

`myDiff`

) which they increment in the for loop. This can be done in parallel without synchronizing with other threads. Then, when the partial sum is done, it uses locks to add to the global "diff".This reduces the amount of sequential computation in the code because now the threads are more independent and so it can perform better.

Question:Imagine this program was being run on a big supercomputer with 100,000 processors (`NUM_PROCESSORS`

= 100,000). How might you modify the implementation of the reduction to avoid performance loss due to too much serial execution?Instead of doing the reduction serially you could possibly again do the reduction in parallel. You could say values processor 1 will handle the first 1000 processor values processor 2 the next 1000 and so on. Then these 100 processors could broadcast their own sum values to each other and calculate the sums and then rebroadcast the final sums to all the processors. The exact mapping of which processor becomes the reducer of which group of processors may depend on the architecture and chip design.

As an addition to previous discussion, I want to add that if the implementation uses a message-passing method, then following can be done: