Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Lawliet

Is it possible to rewrite the loop in this code to be:

 for (int i = programIndex; i < N; i += programCount)

and remove int idx = i + programIndex? Would this be still valid?

makingthingsfast

I had another question. Professor Kayvon mentioned in lecture that we use uniform int i = 0 in the for loop to keep it uniform (instead of starting from programIndex). I didn't entirely understand why we do this. Could I possibly have an explanation again?

kku

A uniform value means that the same value is shared across a gang. This leads to 2 possible optimizations. First is that we can save space since we only need to store the value once. Secondly, if we have a control flow that's dependent on a uniform value, the compiler can avoid generating code for dealing with divergent control flow (ie setting masks).

kku

@Lawliet

Changing the loop to for (int i = programIndex; i < N; i += programCount) would not be valid.

In each iteration, programIndex has the values [0, VECTOR_WIDTH - 1], corresponding to the SIMD lanes used by a gang.

althalus

I have a question: does this mean that the programIndex always corresponds to the SIMD width (e.g.: either 4 or 8)?

EDIT: ProgramIndex refers to the index into the instances generated while ProgramCount refers to the total number of instances. So my new question is does the SIMD length (4/8) affect the index and count in any way?

kku

SIMD width = ProgramCount

thomasts

@kku programCount may also be a multiple of the SIMD width, right? Slide 11 says "Number of instances in a gang is the SIMD width of the hardware (or a small multiple of SIMD width)." My understanding was that the gang size and programCount are equal.

althalus

@thomasts, that was my understanding too. However, there cannot be more gangs than the number of elements able to be packed in a vector for a SIMD instruction. Thus, I assumed that the programCount would be equal to or less than the SIMD width.

Elmur_Fudd

@Lawliet @kku

I don't see why it would be invalid. By removing the uniform keyword, we should be allowed to assign programIndex to i. This program would still be logically correct and output the correct answer. However, since we lose the optimization given by the uniform keyword, it may not run as fast.

kku

Suppose we write

for (int i = programIndex; i < N; i += programCount)

At the kth iteration of the loop, i for each lane would all be programCount * k.

rajul

@kku after the kth iteration the i for each instance would be programCount * k + programIndex as we start at programIndex thus in every iteration we still work on the same data.

kku

@rajul

Yeah you're right. i is not a uniform value anymore.

Sorry for the confusion.

karima

@althalus and @thomasts

These are good questions. If you read the Intel ISPC documentation it will tell you that gang sizes can be small multiples of the hardware's vector width (typically 2x).

So the next logical questions are:

1) How is this compiled into code that will run on the machine's given vector width? and

2) Why would you want to set up the ISPC compiler to create gang sizes that are multiples of the vector width of your machine?

To answer question 1: the compiler will generate back-to-back vector wide instructions that run on the same core if the gang size is a multiple of vector width. So if your gang size is 8, it will generate two 4-wide SSE instructions.

To answer question 2: Remember that gangs can share data. For example, uniform variables are shared across ISPC instances in a gang. Uniform variables are useful because they reduce storage space required since only one copy of a uniform variable is needed. More importantly, declaring a variable as uniform allows the ISPC compiler to reason about your code and make more intelligent decisions when compiling your code.

For example, if you have a conditional statement that depends on a uniform variable, the compiler can then reason that all program instances will execute that statement, allowing it to skip the overhead of managing the mask and divergent instructions. So having a larger gang size extends these benefits to more program instances than just one vector width.

All this and more can be learned by reading the ISPC documentation which I encourage you to do! It's very well documented :)

traveling_saleswoman

What will happen if N is not a multiple of programCount? Would there be N % programCount indices which do not get computed? As far as I understand, programCount is not determined by the user.

rajul

Yes programCount is not determined by user. N% programCount indices can be computed in the last iteration. This is something you need to handle in one of the Project 1 questions.

mrrobot

@karima the ISPC documentation also tells us to spawn a large number of tasks which are preferably more than the number of cores in the system. According to what @kayvon mentioned in class, I felt that spawning more threads than there are cores would lead to most threads just waiting for other threads to finish in a wait queue. He also mentioned to one student that if we get speedup by spawning more threads then we are not distributing work properly.

Why would ISPC want us to spawn a lot of tasks? From their source code, I could see that they use a worker pool approach in which they spawn worker threads based on the underlying hardware.

karima

@mrrobot I think your confusion might come from not fully understanding the distinction between threads and tasks. YOU as the programmer, tell the compiler how many tasks you wish to launch. The ISPC compiler will make an intelligent decision on how many threads to spawn based how many tasks you've told it to launch and on its knowledge of the hardware it's compiling for.

So for example if you only launched one task, obviously the compiler can only spawn at most one thread. But if you launch 2 or more tasks, the compiler can then choose to spawn more than one thread. Say you chose to launch 1000 tasks on a 2 core machine. Do you think a good compiler would spawn 1000 pthreads as well?

Tasks are an abstraction that tell the compiler what independent units of work exist in your program. Each task must be fully executed on one core as each task maps to one ISPC gang, but because tasks are independent of each other, different tasks can execute on different cores, or the same core. Whichever happens ultimately depends on how the ISPC compiler decides to compiler your code.

The number of tasks you launch does not necessarily equal the number of threads the compiler decides to generate. You are correct that ISPC uses a worker pool approach in which they spawn worker threads based on the underlying hardware.

So I have some questions for you:

1) Say you wrote a program where you launched 100 tasks. If you compile your program for a hyper-threaded machine with 4 cores, how many threads do you think a reasonable compiler will decide to spawn?

2) Obviously each thread can only work on one task at a time. What happens when one of these threads finishes a task and there are still tasks left to be done?

mrrobot

@karima I think you misunderstood my question. I'm aware of the difference between tasks and threads. I know that as per ISPC tasks are just work abstractions and threads pick up tasks from a worker pool and execute them one at a time. If a thread is done with one task it will pick up another task and start running it. To answer your questions:

1> The compiler will decide to launch between 4-8 threads as a worker pool which will keep picking up tasks from the task queue. 2> If it's done with one task, it will pick up another task from the queue and run that.

What I meant to ask was why would ISPC tell us to launch "many tasks" : To quote from documentation: "In general, one should launch many more tasks than there are processors in the system to ensure good load-balancing, but not so many that the overhead of scheduling and running tasks dominates the computation"

Why would I need to do this if I can properly divide my work into tasks such that load is evenly distributed. Shouldn't I just launch as many tasks as there are cores in the system?

rajul

@mrrobot if each task is doing exactly the same amount of work then yes if you launch just as many tasks as logical cores you would be good to go, but in practice dividing work exactly evenly is not possible and in that case creating more tasks help.

karima

@rajul and @mrrobot

Exactly right, if you can divide up your work evenly before launching tasks there's no reason to launch more tasks than logical cores. But it's often the case that you're just as well off creating more tasks and letting the compiler/system do the work distribution for you. Less work for you as the programmer to do. Plus the pre-computation necessary to divide work up evenly may be expensive in some cases.

BensonQiu

It seems there is a point of diminishing returns where there's no additional benefit to launching more tasks (the work is already evenly divided amongst your tasks), but is there a downside to launching too many tasks? I would imagine that more tasks implies more overhead - but what exactly is the overhead for a task?

momoda

@BensonQiu

Yes, there is a downside to launching too many tasks. In ISPC user guide, we know that: "In general, one should launch many more tasks than there are processors in the system to ensure good load-balancing, but not so many that the overhead of scheduling and running tasks dominates the computation." So the number of tasks is case-by-case. Maybe we should analyze the speedup to determine the number of tasks, like what we do in assignment 1.

The overhead may be the schedule time, create thread time, more memory. I am not sure.

smoothcriminal

The slide says the ISPC program instances run concurrently. Wasn't the advantage of SIMD execution units so that arithmetic that uses the same instruction stream can run in parallel, not concurrently?

karima

@smoothcriminal The slide says they run concurrently because here we're talking about a programming abstraction. The abstraction only requires that the work done by the program instances are concurrent. That is, the order in which they're executed does not matter.

Implementation wise, yes a vector width's worth of program instances are executing simultaneously.

MaxFlowMinCut

Why are denom, sign, and j marked as uniform here if they are updated by each program instance?

kayvonf

@MaxFlowMinCut. That's precisely the point! By marking them as uniform the programmer is declaring that they are not updated by each program instance. Instead there is a single copy for all program instances.

Equivalently, you can think about it as only one of the program instances doing the uniform work, and all other instances being able to read those values.