Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 2 of 48

jezimmer

Not gonna lie, I still don't really understand this code. Part of it I think stems from the fact that I can't see it's value. Is there any use at all to using this over forall? Is there a simple example where this style would be preferred over just specifying a more typical "data-parallel" for loop?

marjorie

I feel like I understand the code in broad terms, but I don't understand why i, j and sign are uniform.

As I read the code, when sinx is called by the C program, the thread of control will pass to each of a gang of workers; let's say there are X workers in the gang, where X = programCount. The outer loop divides the array into chunks of X elements each, so that the first worker will take the first element in each chunk, the second will take the second element, and so on.

I'm really confused why i is uniform, though. Wouldn't we want each worker to have its own copy of i so it iterates through the array at its own rate? If i is uniform, doesn't that mean that when the first worker to finish its first loop iterates i, the value of i changes for every worker? And then when the next worker finishes, wouldn't it then increment i again? That's how it would work if a uniform value were like a global value shared by pthreads, at least. However, what must really happen is that i only increments once ALL workers finish the loop, instead of incrementing each time ANY worker finishes the loop. That makes sense when I think about the implementation, where all these computations are happening simultaneously in a vector, but it's not at all obvious to me in the abstraction.

But anyway, within the outer loop, each worker gets the appropriate value from the array and creates its numerator, and uses the universal denominator and sign. Then, in the inner loop, it adds together all the terms in the approximation, flipping the sign on each pass through the inner loop. But again, I'm confused why this sign is uniform, since the value needs to be consistent between passes within each worker, not across workers. What would happen if sign weren't uniform? Does that mean the sign flipping would all happen sequentially in SIMD implementation, instead of being a sync-up point between lanes?

At any rate, after all the inner loops, each worker it saves the sum of all those terms in the result array.

As for doing this instead of forall, I can't imagine any benefit, unless there's some circumstance in which you think the ISPC compiler is going to choose a non-optimal way of breaking down the work. I guess I don't know how clever the compiler is.

srw

Addressing the issue of why the program's use 'uniform' in loops, especially inner loops:

I thought that uniform was like a global variable, where changes in one instance affect it's value in all other instances. But I don't think this is the case. I looked at the documentation, especially here: https://ispc.github.io/ispc.html#uniform-control-flow. It seems that variables are uniform if the logic affecting them is the same in every instance. For instance, no loop alters them using an instance specific value, such "programIndex", or a random value. The values of i and j differ throughout the program, but they differ in the same way across all instances. This DOESN'T mean that their values differ at the same TIME across all instances.

This is my understanding of uniform as things stand. If I am mistaken, someone please correct me.

andrewwuan

When we say none iterations of the loop are parallelized by ISPC, is it because that we explicitly use programCount to distribute work among all workers, so ISPC won't help us parallelize? If this is the case, does ISPC only parallelize stuff for us when we use keywords like "forall"?

srw

@andrewwuan That seems to be the case. The for vs foreach comparison is explicitly considered on Slide 14 For more information about how foreach works, there is a short but thorough explanation in the Intel docs here: https://ispc.github.io/ispc.html#parallel-iteration-statements-foreach-and-foreach-tiled. This also includes details about other loops in ISPC, and is a useful read.

gryffolyon

Still not clear on why and how the uniform works for the iterators i and j as Marjorie pointed.

Also when we use uniform, does ispc bring about synchronized access to these variables between all the program instances?

andymochi

@marjorie & @gryffolyon Feel free to poke some holes in my re-attempt at explanation:

First think back to how/why ISPC was designed. The goal is to give the programmer a language to specify what kind of work can be vectorized (SIMD instructions). Somehow - from ISPC code, and ISPC compiler is able to organize the work and code execution in a 'smart' way, which you'll be able to see if you examine the assembly.

The importance of the 'keywords' in ISPC like 'uniform' and 'foreach' are to explicitly cue the compiler to see some code and say "Hey, we can probably optimise our code here!"

On the 'uniform' keyword, the intel guide explains a few of the optimisations they make. The biggest one seems to be it's usefulness on conditional checks only involving 'uniform' data. The compiler can optimise this into a single check that can be shared across program instances.

On the 'foreach' stuff above, the crucial difference is that no 'cue' for the compiler is explicit in the code. That is why we say it is not 'parallelized' by ISPC. My understanding is that ISPC has no responsibility to try to parallelize either of these loops. That being said, I'm pretty sure the ISPC compiler is smart enough to recognize this case, and compile to essentially the same code (except for the code that accounts for N/programCount). This is where abstraction differs from implementation. If you use the foreach, you can expect that compiler will try to parallelize this, but you can't expect this in the above code (even though it might be smart enough to do it in this case).

Afterthought: Remember though, that these keywords are 'cues' for implementation optimisations. The compiler could or could not make optimisations using these keywords as long as the implementation matches the ISPC abstraction. While reading the section on 'uniform' data, I saw this tidbit "In this case, a sufficiently smart compiler could determine that dx and dy have the same value for all program instances and thus generate more optimized code from the start, though this optimization isn't yet implemented in ispc."

It's easy to forget that ISPC implementations are still under development, too! As long as we include all of the 'cues' that ispc asks for in it's abstraction, we can expect our code to be more and more optimized as the ISPC implementation gets better.

P.S. I'm still a noob at markup, so sorry for all the "'"s everywhere. ROFL those are my attempts at emphasis.