Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

Parallel Programming Abstractions (and their corresponding HW/SW implementations)

Previous | Next --- Slide 18 of 56

adityamathur

By not defining partial as "uniform" we can get a unique copy of partial for each of the instances. Partial will be actually an array of variables where each instance linked to an index in that array. At the end of the individual computations, the values in the array will be summed by the reduce_add functions and thus giving the correct sum.

But this approach again will be dependent on one instance in the end so as to compute the final result, same as the case in demo 2/3 in the intro class. Just a point.

Question: Since the foreach loop does not help us speed up the program at all, does this parallel sum program help speed up the performance? Why or why not?

ehauser

@TJ I'm not sure I understand the question correctly. I'm under the impression that the foreach loop does speedup the computation of partial as a SIMD instruction. Hence, the explicit parallel sum of partial would perform similarly to the foreach code. For the final computation of sum, I believe the foreach code would be the same if not better, since the compiler could output a parallel summation in logarithmic time as opposed to the current sequential sum.

Reading back through my answer, I'm sure I misunderstood the question, so please let me know if you would like me to clarify something.

@ehauser, thanks.

I think you got my question. I wanna summarize in this way:

foreach loop store x[i] into a partial of each execution instance. "partial" in each execution instance must be different.
parallel summation should be performed in logarithmic time (Here we speed up the performance)

fire

I am curious whether the actual implementation within ISPC also has a "tmp" array to store partials of instances within a gang and then uses a loop to sum up the final result. If it is the truth, I think this might be a reason why SIMD cannot reach an ideal speedup, as not every instruction can perform in parallel.