Previous | Next --- Slide 7 of 47
Back to Lecture Thumbnails
Avesh

The key idea in the ISPC code lies is the first for loop. This loop should compute the sine of N numbers. This code cuts these N computations into chunks of size programCount. Each chunk of size programCount is run in parallel.

Each program instance computes the elements of the array i when i mod N = programIndex.

jmaclean

This interleaved implementation is preferred to the blocked implementation in the next slide because with interleaved program instances when the program instances read memory (all at the same time), that memory is contiguous and easy to grab all at once.

nslobody

Question: Is this the same idea as a shader, e.g. GLSL?

kayvonf

Close. Graphics shading languages like GLSL or HLSL are designed to provide a more "pure" data-parallel set of abstractions. Thus a GLSL program looks a lot more like my example data-parallel/stream programming abstractions on slide 37. You write the kernel function (a.k.a., a GLSL shader), and it is executed by the graphics system once per element of an input stream of vertices/fragments/etc. There is no way to reference any data but the stream element the kernel are provided (plus some per-kernel read-only state such as textures). In fact I almost decided to use GLSL as an example there, but was worried about confusing those that have not taken 15-462.

yingchal

Question: Can I understand the uniform modifier in this way: It is like a static modifier in a java class, all of the instances of this class share the same value of this variable, while the variable without uniform modifier in a ispc program is like private, only this instance has access to this variable.

kayvonf

@yingchal: That's an okay (but not perfect) analogy. I'll say it another way: All program instances in an ISPC gang share a single instance of a uniform variable. This is similar to a per-thread-block shared variable in CUDA. Non-uniform variables are local to the program instance, so each ISPC program instance maintains a potentially unique value for the variable. Uniform variables go out of scope when the gang completes.

unihorn

Question: There is a nested for loop in this example. We could use ISPC gangs on variable i, or on variable j. Using on i and j at the same time are forbid currently. Does it have the same effect on both ways? If not, how can we decide on it?

kayvonf

I'd like to see someone else try and answer @unihorn's question. There is a nested loop in this example. Which iterations (if any) are parallelized by ISPC? Which are not? Hint: think carefully!!! If you get this answer right you've got a good command of the implementation of the ISPC language.

nkindberg

I'll take a stab.

If I understand the question correctly, I think the answer to @unihorn's question is: gang instances are not assigned to loops.

The abstraction we are using here is SPMD (single program, multiple data). Thus ISPC spawns a gang of programCount instances of sinx. There is no parallelization of loop iterations; rather there are multiple instances of the function running which execute in parallel.

TeBoring

The video related to this lecture mentioned the abstraction and implementation of ISPC. I am still not quite sure about their differences. My understanding is:

ISPC abstraction: your SPMD program will run the code on a gang of program instances, and return only after all of them have finished.

ISPC implementation: a gang of program instances will be run on the SIMD lanes on the same core.

Is that right? If yes, I still have confusion that why "run on the SIMD lanes of the same core" is not part of the answer to "what is the abstraction of ISPC"? (how it is implemented may have side effect, why don't we need to include that in the abstraction?)

pebbled

@TeBoring: The notion of a "gang of program instances" is an abstraction in itself- really these independent pieces of computation are compiled into a single stream of (4/8/...-wide SSE/AVX/...) instructions which are run sequentially on a single core. These instructions are capable of manipulating multiple pieces of data stored in a special register file (XMM/YMM on SSE4/AVX chips).

This implementation has no side effects other than those made plain by the abstraction: if multiple program instances mess with the same data, there will be race conditions. It doesn't matter whether the threads of execution are being concurrently scheduled on a single-core processor, run in parallel on multiple cores, or secretly combined into a single sequential stream of wider instructions. (Of course there are differences in performance, but optimization time is when we disabuse ourselves of this abstraction)

kayvonf

@TeBoring: Your description of the ISPC programming model abstraction and the ISPC implementation is completely correct. However, given the semantics of the ISPC programming model, alternative implementations are also possible. For example, for the program above, a perfectly valid ISPC compiler might not emit AVX instructions and simply emit a sequential binary that executes the logic from each of the program instances one after each other. The resulting compiled binary is correct -- it computes what the program specifies it should compute -- it's just slow). Alternatively, it would be valid for the compiler to have spawned eight pthreads, one for each program instance to carry out the logic on eight cores.

The point is that there is nothing in the basic ISPC model that dictates this choice and the programmer need not think about the implementation if he/she simply wants to create a correct program. However, as I'm sure it's now clear in this class, the programmer will need to think about the underlying implementation if an efficient program is desired.

Now, I'm being a little loose with my comments here because there are in fact certain details in ISPC that do put constraints on the implementation. One example is the cross-instance functions (e.g, reduceAdd), which do place a scheduling constraint on execution. Since reduceAdd entails communication of values between program instances, all program instances must run up until this point in the program to determine what values to communicate. (Otherwise, it's impossible to perform the reduction!) Consider an ISPC function that contains a reduceAdd. The ISPC system can not run instance 0 to completion before starting instance 1 because the reduceAdd operation, which must be executed as part of instance 0, depends on values computed by instance 1.

mmp

Clarification and corrections on @kayvonf: The implementation of ISPC is actually a bit more constrained by the language semantics (and I'll argue that's a good thing.) It'd be difficult to provide an implementation of ISPC that used one CPU hardware thread (think: pthread) per program instance and ran efficiently.

ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement. This allows writing code like:

uniform float a[programCount];
a[programIndex] = foo();
return a[programCount - 1 - programIndex];

In the third line, program instances are reading values written by other program instances in line 2. This is valid ISPC, and is one of the idiomatic ways to express communication between the program instances. (See The ISPC User's Guide for details.)

So one implication of this part of the language abstraction is certainly that it constrains viable implementation options; perhaps one could implement ISPC by spawning a pthread per program instance, but it'd be necessary for the compiler to conservatively emit barriers at places where it was possible that there would be inter-program instance communication. This in turn is tricky because:

  • In a language with a C-based pointer model like ISPC, the compiler has to be pretty conservative about what a pointer may actually point to.

  • Barrier synchronization on CPUs is sort of a pain to do efficiently. (This is a challenge for CPU OpenCL implementations; see discussion in this paper for some details and further references.

Both of these also mean that an implementation that implemented program instances as one-to-one with CPU HW threads would make it much harder for the programmer to reason about their program's performance; small changes to the input program could cause large performance deltas, just because they inadvertently confused the compiler enough about what was actually going on that it had to add another barrier. Especially for a language that is intended to provide good performance transparency to the programmer, this is undesirable.

In the end, I don't think there is any sensible implementation of an ISPC gang other than one that ran on SIMD hardware. Indeed, if you take out the foreach stuff in the language and just have the programCount/programIndex constructs, you can look at what is left as C extended with thin abstractions on top of the SIMD units, not even an explicitly parallel language at all. (In a sense.)