Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

khans

The program itself is run serially. Different instances of the code are run in parallel, but the program with all its loop iterations is serial.

bochet

One question, if I change the outer loop to foreach, is it parallelized by ISPC?

asd

@bochet: Yes, indeed it should be parallelized by ISPC.

kayvonf

@bochet, asd: I almost entirely agree with your conclusion here, but things are a little subtle. Consider this ISPC program. Can you describe to me what it does, and where the parallelism is?

export void foo(uniform int N, uniform float in[], uniform float out[]) {

   float x = in[programIndex];

   foreach (int i = 0 ... N) {
       x += in[i];
   }

   out[programIndex] = x;
}

amg

So here multiple program instances will execute foo in parallel. We have some non-determinism, as it is up to the ispc compiler to determine programCount, but I think we will end up with the sum of all of the values in in placed in out at indexes 0 to programCount.

asd

As per the ISPC documentation, and I quote- 'in each loop iteration, the running program instances effectively collude amongst themselves using programIndex to determine which elements to work on in a way that ensures that all of the data elements will be processed.'

So, as @amg said, multiple instances of foo will be executed in parallel such that we will have the total sum of all elements of 'in' in 'out'.

Hope this is right?

kayvonf

@amg, @asd: Close. It's true that out[i] will be written to for all values of i between 0 and programCount. Let's just say programCount is eight for now.

What values are in the eight elements of out that are written to? You seem to suggest that its the sum of all elements in in[]. Are you sure? Is the answer even defined by ISPC?

sampathchanda

It is definitely not the sum of all elements in 'in[]'. It could just be sum of two floats, one from initialization and other from inside the for loop. The float from the for loop will be from the iteration whose program instance was the last to be executed. That is, the answer is un-deterministic.

However, I have a question here which kind of confused me. foreach loop is used to spawn the independent iterations automatically onto different program instances. What happens when there is not foreach loop? Will the programIndex be constant or still be varying ?

For example, what would be the output of the following ISPC function when called:

export void sample_func() {
print ("programIndex = %\n", programIndex);
}

kayvonf

@sampathchanda. Write the code, run it, and tell us!

Also, your confusion stems from this statement, "foreach loop is used to spawn the independent iterations automatically onto different program instances". Take a look at @asd's comment above that quotes the precise semantics of foreach. Note there's no statement there about foreach spawning anything at all. The instances exist from the start of the ISPC function call until the termination of the functional call. foreach does not create instances.

sampathchanda

Thanks. I totally overlooked the fact that the program instances are SIMD units, which are execute the same instruction on Multiple data, with nothing to do with spawning. I expect the program to be printing programIndex values in an un-deterministic order.

lfragago

So, one can view this as just one instance of execution of a gang. Each instance is run serially, and the fact that we have multiple instances running (gang) after spawning is what makes the implementation parallel.

asd

@kayvonf: Will this code attempt to spawn instances wherein each instance will add the sum of the array to the element at the index location equal to the program Index? But the problem here would be that we would get an indeterministic result since each instance will attempt to write to the shared variable x?

Hence, even in the code given in the slide, if we change the outer loop to foreach, we will mostly not get the correct values due to the final answer going into the shared variable "value" across instances?

kayvonf

I want this to be absolutely clear. The code on this slide does not spawn program instances! The code on this slide does not spawn program instances! The code on this slide does not spawn program instances! ;-) This is the code executed serially by each instance. It is executed by all instances in a ISPC gang.

The gang of instances is created when the caller calls the sinx function. When all instances complete the body of this function, control returns to the caller.

anonymous

@kayvonf Let me try to explain the foo function.
1, when foo function is called, gang size (programCount) of instances is created;
2, float x = in[programIndex]; // x is given the initial value according to the instance ID (programIndex).
3, foreach (int i = 0 ... N) {
x += in[i]; // foreach is just a mapping between in data and the processing instance. let's say interleaved and gang size is 4.
}
4, out[programIndex] = x; // the calculating result in each instance is stored in out.

I do the experiment on GHC and verify my conjecture. N is set to be 20, and in is initialized to be 0 ... 19. I configure the ispc compiler to be "--target=sse2-i32x4", so the gang size is 4. And the final result in out is 40.00 (0+0+4+8+12+16), 46.00 (1+1+5+9+13+17), 52.00 (2+2+6+10+14+18), 58.00 (3+3+7+11+15+19), which verifies the mapping of "foreach" for sse2 in GHC is interleaved.

mak

What's the purpose of the "uniform" qualifier in inner most "for" loop? How does it help compiler in optimization?

And most importantly, why doesn't it create race condition similar to "uniform" sum variable in example of lec3, slide 18.

Does "uniform" qualifier has more than one meaning, based on context?

vadtani

@mak

why doesn't it create race condition similar to "uniform" sum variable in example of lec3, slide 18.

In lec3, slide 18 example there is only one copy of sum for all program instances, and if you notice we introduced parallelism by using foreach as well. So multiple instances can load the same value of sum, update and write back to it. Whereas, in this case you can see from the slide the loops are not parallelized.

kayvonf

@mak: Both loops (the inner and outer) in this example have uniform control flow. In other words, the loop counters, loop termination conditions, etc. are all uniform expressions. By definition this means that this logic is exactly the same for all program instances, and need only be carried out once for the entire gang (not once per program instance.)

Knowledge that logic (in particular control flow logic) is uniform can lead to a lot of efficiencies when generating code for a modern CPU, since I modern CPU supports both SIMD and non-SIMD scalar instructions. For example, all the logic for managing the loops can be carried out with scalar instructions instead of SIMD ones. Not only to most CPUs have greater instruction throughput for scalar operations than vector ones, the code might also get the benefit of superscalar execution, since the scalar ops can issue simultaneously with the SIMD ones. Second, since it is statically known that all program instances execute the same number of iterations of the loop (even if the actual number of iterations is determined at run time based on the value of terms), there is no need for the compiler to emit code for managing per-instance masks, or dealing with divergence in instance control flow, since conditions of divergence cannot be created from these loops.

If you think about your solution to Assignment 1, Program 2, you can imagine how your code might be different if you knew certain logic was uniform. That's exactly why the creators of ISPC provide language abstractions (in this case a type modifier) for the programmer to explicitly denote uniform values and expressions.

shiyqw

Slide 12 of the previous lecture also does not use foreach key word. Is that program parallelized?

albusshin

The program specifically specified the loops to be "for" loops instead of "foreach", so this is telling the compiler that this is the implementation that I want, instead of an abstraction. The program executes without any iteration being parallelized.