Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 41 of 48

kayvonf

Question: can someone explain to me why this implementation is likely much more efficient than the one on the previous slide?

lilli

The implementation on the previous slide locks the shared myDiff variable after each (i,j) computation is achieved by each worker.

This implementation improves on that by taking advantage of the fact that a single worker can save intermediate results in a variable accessible only to that worker, and then update the shared sum with its own intermediate sum after it has processed all of its cells. There is now only one lock per stage of while loop per worker.

This is possible because the sum operation here is associative.

BestBunny

Just to add on to @lilli's answer, the code on the previous slide essentially created sequential execution because only one modification to 'diff' by any one thread could be made at any given time (resulting from the process of locking on each iteration of the inner for loops). This code on the other hand, utilizes the threads to parallelize the intended computation by locking only once per iteration of the while loop in order to combine results with the other threads and check convergence (inherently sequential dependency). The difference between the two slides is similar to the difference between the ISPC sum code that used a uniform sum versus the code that used partials.

chenboy

In the previous slides the critical section is in the innermost loop, meaning there is a portion of operations we cannot accelerate by parallelization in that loop. What makes it worse is we have to undertake the huge overhead of acquiring and releasing locks each iteration. On the other hand, if we use a partial sum, then the whole inner loop can be parallelize. The length of critical section is very short in this implementation, and we only need to acquire and release the lock once in the while loop.

VersaceGohan

This implementation locks access to diff once per thread whereas the previous slide locks access to diff (myMax - myMin) * red cells in that row many times per thread.

connorwa

Could we accomplish this better by having each thread store its partial sum in an array and then reduce +?