Previous | Next --- Slide 7 of 52
Back to Lecture Thumbnails
kayvonf

One way to test your understanding of ISPC's mapping of the SPMD programming model abstraction onto SIMD instructions is to see if you understand what the compiler will do if you set the gang size to 8 on a SIMD-width 4 machine. (Or a gang size of 16 and a SIMD-width 8 machine.

A student last year posted the following:


To demonstrate the differences in compiling with different --target options, namely how the double-width is handled, I made a small test program:

export void foo(uniform int N, uniform float input[], uniform float output[])
{
  for (uniform int i = 0; i < N; i+= programCount) {
    output[i + programIndex] = 5 * input[i + programIndex] + 7;
  }
}

It's easy to find the duplicated instructions in the double width output.

sse4:

1e3:   0f 10 14 06             movups (%rsi,%rax,1),%xmm2
1e7:   0f 59 d0                mulps  %xmm0,%xmm2
1ea:   0f 58 d1                addps  %xmm1,%xmm2
1ed:   0f 11 14 02             movups %xmm2,(%rdx,%rax,1)

sse4-x2:

333:   0f 10 14 06             movups (%rsi,%rax,1),%xmm2
337:   0f 10 5c 06 10          movups 0x10(%rsi,%rax,1),%xmm3
33c:   0f 59 d0                mulps  %xmm0,%xmm2
33f:   0f 58 d1                addps  %xmm1,%xmm2
342:   0f 59 d8                mulps  %xmm0,%xmm3
345:   0f 58 d9                addps  %xmm1,%xmm3
348:   0f 11 5c 02 10          movups %xmm3,0x10(%rdx,%rax,1)
34d:   0f 11 14 02             movups %xmm2,(%rdx,%rax,1)

The avx dumps were similar. It's interesting that the movups instructions for each group of 4 are done one after another, but the mulps and addps are grouped.


kayvonf

In class we talked about the possibility of other potential implementations of the ISPC abstraction (e.g., non-SIMD). I'll repeat my in-class point below, but make sure you read to the end to get the full story, as Matt Pharr, the creator of ISPC stepped in to clarify my response:

Kayvon said: Given the semantics of the ISPC programming model, alternative implementations are also possible. For example, for the program above, a perfectly valid ISPC compiler might not emit AVX instructions and simply emit a sequential binary that executes the logic from each of the program instances one after each other. The resulting compiled binary is correct -- it computes what the program specifies it should compute -- it's just slow). Alternatively, it would be valid for the compiler to have spawned eight pthreads, one for each program instance to carry out the logic on eight cores.

The point is that there is nothing in the basic ISPC model that dictates this choice and the programmer need not think about the implementation if he/she simply wants to create a correct program. However, as I'm sure it's now clear in this class, the programmer will need to think about the underlying implementation if an efficient program is desired.

Now, I'm being a little loose with my comments here because there are in fact certain details in ISPC that do put constraints on the implementation. One example is the cross-instance functions (e.g, reduceAdd), which do place a scheduling constraint on execution. Since reduceAdd entails communication of values between program instances, all program instances must run up until this point in the program to determine what values to communicate. (Otherwise, it's impossible to perform the reduction!) Consider an ISPC function that contains a reduceAdd. The ISPC system can not run instance 0 to completion before starting instance 1 because the reduceAdd operation, which must be executed as part of instance 0, depends on values computed by instance 1.


Then Matt Pharr, creator of ISPC corrected me with the following clarification...


Clarification and corrections on @kayvonf: The implementation of ISPC is actually a bit more constrained by the language semantics (and I'll argue that's a good thing.) It'd be difficult to provide an implementation of ISPC that used one CPU hardware thread (think: pthread) per program instance and ran efficiently.

ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement. This allows writing code like:

uniform float a[programCount];
a[programIndex] = foo();
return a[programCount - 1 - programIndex];

In the third line, program instances are reading values written by other program instances in line 2. This is valid ISPC, and is one of the idiomatic ways to express communication between the program instances. (See The ISPC User's Guide for details.)

So one implication of this part of the language abstraction is certainly that it constrains viable implementation options; perhaps one could implement ISPC by spawning a pthread per program instance, but it'd be necessary for the compiler to conservatively emit barriers at places where it was possible that there would be inter-program instance communication. This in turn is tricky because:

  • In a language with a C-based pointer model like ISPC, the compiler has to be pretty conservative about what a pointer may actually point to.

  • Barrier synchronization on CPUs is sort of a pain to do efficiently. (This is a challenge for CPU OpenCL implementations; see discussion in this paper for some details and further references.

Both of these also mean that an implementation that implemented program instances as one-to-one with CPU HW threads would make it much harder for the programmer to reason about their program's performance; small changes to the input program could cause large performance deltas, just because they inadvertently confused the compiler enough about what was actually going on that it had to add another barrier. Especially for a language that is intended to provide good performance transparency to the programmer, this is undesirable.

In the end, I don't think there is any sensible implementation of an ISPC gang other than one that ran on SIMD hardware. Indeed, if you take out the foreach stuff in the language and just have the programCount/programIndex constructs, you can look at what is left as C extended with thin abstractions on top of the SIMD units, not even an explicitly parallel language at all. (In a sense.)

lixf

I have some questions/concerns about Matt Pharr's reply and hope someone could explain some details to me. Thanks in advance.

  1. He says: "ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement." Is this really preferable for a efficiency-driven language? I understand that this constrain will make inter-instance communication possible, but why do programmers want to synchronize in an ISPC program? For example, if I want to sum up an array like the example in class, it should be my responsibility to use an array of partial sum inside the ISPC code and to avoid synchronization. I remember we talking about one of the "requirements" to use SIMD is the absence of dependencies among different runs of the same program. Thus, wouldn't this constrain defeat the purpose of having ISPC at the first place (an example of a "regionally" unbalanced work load would result in terrible performance)? (or perhaps I'm missing something)

  2. By "barrier", does he mean locks? Also, he says: "an implementation that implemented program instances as one-to-one with CPU HW threads would make it much harder for the programmer to reason about their program's performance." I don't quite understand why this is the case. If they were to use HW thread implementation, isn't it just like having pthreads? Why is pthread hard to reason?

  3. At the end, he says: "if you take out the foreach stuff in the language and just have the programCount/programIndex constructs,… not even an explicitly parallel language at all." But isn't the underlying implementation achieved by using SIMD vectors? Does he mean it only does't LOOK like a parallel language?

LilWaynesFather

Sorry, perhaps it was explained in lecture but what exactly does --target change in an ISPC program? The size of the gang size?

kayvonf

Yes, --target is a compiler directive to set the instruction set targeted and the gang size of a n ISPC program. For example --target sse4-i32x8 says use a gang size of 8 and emit SSE4 instructions. --target avx2-i32x16 means emit AVX2 instructions and use a gang size of 16.

Full reference is http://ispc.github.io/ispc.html#selecting-the-compilation-target.

iamk_d___g

@lixf: Hi,

  1. I think the efficiency of the semantic ISPC chooses depends on the workload type, as no single language is suitable for every kind of job. If ISPC is not a good fit for the work, then we shouldn't use it.

  2. Barriers are used to order instructions, preventing problems caused by reordering of memory instructions at compile or run time. What I interpret Matt means here is that, since barriers are expensive to execute and using pthreads requires adding (sometimes unexpected) barriers, programmers are hard to reason about performance.

  3. I think he means that if only the programCount/programIndex constructs remain, ISPC doesn't even take advantage of multicores, so it's hard to call it a "parallel" language.

eatnow

@lixf: For 1., I am guessing that this type of "synchronization" is "free", and comes at no additional cost. If we have a bunch of threads then each thread needs to wait for every other to reach the same point, hence the need for barriers. But in this case, each operation occur across the gang before the next operation.

It looks like increasing the gang size would work until the point where we have more gangs than can fit on the mmx registers. What happens then? Is it simply not allowed, or is ISPC able to do something fancy to ensure the "synchronize after each program sequence point" guarantee?

mmp

Lots of good questions and comments here!

Re 1: @lixf makes a fair point that that synchronization isn't generally taken advantage of by programmers--really only in cases where there is inter-program-instance communication, which is not the most common case. I think what I'd say here is that there's essentially no cost to it in practice (it's generally the natural way to compile ISPC programs to a CPU), and there is some value to not requiring explicit synchronization points in the program as written by the programmer for the cases when synchronization is needed..

Re 2: @eatnow explains it well. I was trying to make the point that, given ispc's semantics of synchronization points after every program sequence point, if one wanted to implement ispc where each program instance was in a separate thread, then the compiler would have to emit code such that all threads synchronized after each program statement--this would be really slow. So then a better implementation might try to only synchronize threads after statements where there might have been inter-program-instance communication, but due to C's pointer model, it's hard to reason about that. So there would be lots of unnecessary synchronizations, and the programmer would likely often be surprised that synchronization were happening at places where instances weren't in fact communicating...

On 3, what I actually meant is that it's more of a language for doing computations with short, programCount-sized vectors, than it is a more general parallel language. I'm not sure that's a super important point in any case.

Finally, ispc only supports a few gang sizes--basically from 4 to 16-wide (depending on vector ISA compiled to--SSE vs AVX, etc.) In general, increasing the gang size increases the number of live registers, so if it gets too big, cycles are wasted on register spills/refills.

I think that overs everything, but let me know if anything's still not clear...