Previous | Next --- Slide 16 of 52
Back to Lecture Thumbnails
pwei

So I am just trying to untangle this in my head. In the program on the right, we have a uniform return value sum, and a regular old float partial. foreach then creates a bunch of instances of the loop, each with its own partial (call them partial0, partial1, ...). When the loop ends, the function reduceAdd looks for all the things called partialx and sums them to make the uniform sum. I think going from having a bunch of partials running around to having one function grab all of them and add them together is confusing to me (and perhaps I'm thinking about it wrongly).

yanzhan2

I think the problem is with reduction operation, the parallelization is not that easy such as saxpy(in assignemnt1). So suppose that are 80 elements and 8 wide SIMD, so there should be 10 elements operation for each wide SIMD. In this case, roughly 8 times faster, because each SIMD execution unit is parallelizing work, although there is synchronization for the final step.

sluck

@pwei I think what is happening here is that, since sumall2 is an ISPC function, a call to sumall2 will spawn a gang of ISPC program instances (rather than the foreach creating the instances). My guess would be since (in the implementation) each program instance is run on a different execution unit, and each program instance executes the same instruction at the same time, each program instance therefore has its own copy of the variable partial (since partial doesn't have the uniform modifier attached).

The foreach itself I think of as an indication to ISPC that the body of the loop contains code that is safe to run in parallel, and that it is up to ISPC to assign work accordingly, (see slide 14).

As for the reduceAdd, there's actually a neat explanation of this code on the ISPC github that says there's a single call to reduceAdd at the end of the loop. But I imagine the reduceAdd requires some form of communication between the instances to compute the sum, and then it stores result in the location of sum which has one copy across all program instances.

yihuaz

Every program instance is run on the same execution unit. Different threads run on different execution units.

rokhinip

@yihuaz What exactly do you mean by an execution unit? Is it an ALU on a single core? Each program instance maps to 1 vector lane in a SIMD vector so the computation is performed by a single ALU.

adsmith

I think he means the portion of the core that does the fetch/decode. In ISPC, each program instance runs on a single ALU, but there is only one fetch/decode for all the program instances. Different threads running at the same time, though, will each perform their own fetch/decode and so can run different instructions.