Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

williamx

Does each instance sum up all N numbers in the array, or does each instance sum up a partition of the array?

jkorn

Because we have the foreach loop, the each instance sums up its assigned partition of the array (decided at compile time). The foreach syntax designates that over all program instances, the entirety of the loop will be executed exactly once (not once per instance, but once in total across all instances).

I have a quick question regarding the return type for ISPC code - it wasn't covered, but I am assuming just like inputs must be uniform, the return type must always be uniform (since you can only return back to your C code once)? Thanks!

M12

How come the reduce_add compiled to the above iterative C code? Is this because the underlying architecture does not have the required instructions? I realize this is up to the implementation, but the implementation should attempt to be optimal. What might the SIMD version of this be? This would parallelize the C code even more.

woohoo

@williamx each instance sums up a partition of the array, and the partial sums of each instance are then summed up using reduce_add. More information regarding reduce_add and the IPSC standard library can be found here: https://ispc.github.io/ispc.html#the-ispc-standard-library

kshitizdange

@jkorn Your assumption for the return type is right, actually Prof. Kayvon did mention that the return type is uniform.

paracon

The C++ AVX implementation complies with ISPC gang abstraction since you have partials that calculate some part of the sum. (The way the loop is structured, it will also follow the same static interleaving that ISPC implementations have). Unless all the partials are calculated, the final sum will not be calculated, again adhering to ISPC semantics. The final loop is equivalent what reduce_add performs.

cwchang

The correspondence between the intrinsic and the ISPC can be thought as following:

In the intrinsic, we use 8-wide SIMD to process the partial sum. In the ISPC, we will probably have 8 (or a small multiple of 8) instances work on the partial sum.
In the intrinsic, we iterate over partial sum to get the final sum. This is similar to the sync step in the ISPC.

rootB

This explains how ISPC code is an abstraction of SPMD fashion, and is implemented in SIMD instructions. The code at right is more higher-level abstraction: for a program, foreach indicates this loop needs to be ran by a gang of instances concurrently; while the actual implementation (SSE/AVX instructions emitted by the ISPC compiler) is like below, which specifies how each instruction operates on different data, the real SIMD instructions. And the implementation complies with the abstraction in that it does exactly what the ISPC code asks the compiler to do: add x[i] to partial sum (by gang of 8 instances, or equivalently vector of width 8 in SIMD instructions), and then do a reduce_add to combine the per instance partial result (each lane of the vector) together.