Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

paracon

The getThreadId() performs the same role as programIndex in the ISPC semantics. The programmer is able to control what rows each thread will process. Since this is a shared address model, we need to acquire locks before updating the diff value (global to all threads). Since acquiring locks in each nested loop iteration is expensive, we see a better way to do the same thing in subsequent slides.

kayvonf

Question: Is the programmer or the compiler/system responsible for mapping per-element work to workers (in this case, SPMD program instances) in this example? Contrast with the data-parallel formulation of the application on slide 34.

kayvonf

And yet another question: What would you say about the level of abstraction of the constructs used in the data-parallel version of this application vs. threads in a shared address space version? If you were a source-to-source compiler, can you see how you might translate the program on slide 34 into a shared address space version?

paracon

@kayvonf : The programmer is responsible for mapping work to workers. The programmer has used getThreadId() call to calculate the starting and ending row that each SPMD instance will calculate. On the other hand, in slide 34, the system is responsible for mapping the work. I found the difference similar to the difference between ISPC foreach and using programIndex+programCount.

kayvonf

Good!

paracon

The data-parallel version of this application uses a higher level of abstraction as compared to the shared memory version. The for_all loop in the data parallel version tells the compiler that all the red cells are independent. The shared-memory model is more low-level since it gives some more implementation details about how multiple threads will be working on different chunks of data simultaneously i.e how is the independence leveraged for parallelism.

If we consider the data-parallel to be a higher level construct, and the shared memory model to be a comparatively lower level construct, a source to source compiler that coverts the program in slide 34 to the program given in this slide, would be adhering to the semantics promised by the data-parallel model (higher level construct).

@kavyonf, I am not sure if I want to call the shared-memory program an implementation and the data-parallel model an abstraction. The reason being, this is still not revealing all the implementation details. Is this reasoning sound?

pdp

The first barrier is to ensure that every thread has diff set to 0 when it starts execution and one thread doesn't overwrite the other's diff. The second barrier is to ensure that diff is fully computed by all the threads and all the threads finish updating diff further. The third barrier is to ensure that no other thread starts and sets diff to 0 while done variable is being computed.

sstritte

In class, we returned to this example after reviewing race conditions on the next slide. If we did not protect the shared variable diff (with a lock, in this example), we are likely to terminate early. We are likely to get a value that is lower than it should be (see below). This could cause the loop to terminate before there is actual convergence.

The reason we are likely to get a value that is too low is demonstrated on the next slide. Consider the possible situation where T0 reads the value of diff, then T1 reads the value of diff, then T0 calculates a new value, then T1 calculates a new value, then T0 writes its new value to diff, and then T1 writes its new value to diff. After all of this has occurred, diff reflects the update from T1 but not from T0. In other words, we missed the contribution to diff from T0, so the value of diff is lower than it should be.

rrp123

This program is has an inherently sequential component within the for loop due to the fact that each thread has to obtain a lock. Thus, the synchronization overhead makes us lose a significant amount of speedup that we get out of parallelizing the program. In order to make each thread run faster and independent of each other, at least within the for loop, we could have each thread have a local copy of the "diff" variable and add values to that. However, we would still need the barrier before we check if the total "diff" is less, since we want all threads to finish and sum up their values of diff, to get the right value of diff.

kapalani

Not sure I understand this completely, but if this program was an ISPC function and the SPMD style parallelism corresponded to having multiple program instances mapped to vector lanes and controlled by a single instruction stream with vector instructions, would there be no need for the barriers? Only because we have multiple threads running on different cores do we need the barriers because we don't know what point in the code each thread is in and they all access some shared variables

kayvonf

@kapalani. Your analysis is absolutely correct. (For everyone else, the code on this slide is code written in an SPMD style, but it is not ISPC code so no assumptions should be made about its implementation.)

shreeduh

A source-to-source translation from data-parallel version to the shared address-space version would require: 1) Explicit mapping of work to program instances, which would mean using the knowledge of current thread to assign specify work. 2) Shared variable implementation of "reduce-add" function available in data-parallel mode, use locks/synchronization by identifying dependencies. Implicit dependencies like the end of forall loop could may be translated into barriers directly.