Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

xiaoguaz

Here we were told that the second decomposition is better because it is more flexible and portable. But I am curious about the algorithms of assignment: It there some rules for "foreach" to assign work parts? Will the system take advantage of spatial locality? If different parts has different run time, will the system try to avoid unbalance?

jaguar

foreach will attempt to vectorize the instruction, which is performed by the ISPC compiler according to the specific hardware

althalus

I am assuming that there will some cases in which static assignment will be better. Can anyone give an examples? Since I feel that this might apply to the last part of assignment 1...

cyl

A good article of how ISPC is implemented: ispc

maxdecmeridius

Here, Kayvon talks about how each iteration of the loop is a program instance, or a worker. However, I thought that program instances were more related to tasks, and these 2 implementations deal more with vectorizing the code (for SMD execution). Can anyone elaborate?

rds

@maxdecmeridius, each iteration of the loop is 'assigned' to a program instance. Each program instance carries out certain loop iterations either determined by the programmer (as in the example on the left) or by the compiler (as in the example on the right). The number of program instances is fixed (specified as a compilation flag, if I am not mistaken). The ISPC compiler maps this code to vector instructions and each program instance maps to one vector lane. I am not sure what you referred to as workers exactly, but, an entire 'gang' of program instances is carried out on a single execution context, which can be thought of as a single thread.

yimmyz

This slide got me kind of confused about the implementation of the left-hand-side approach.

I know that for the right-hand side case, SIMD parallelization (vector instruction) is used for different $i$'s; what about the left-hand side? Are different program instances also run with vector instructions?

Also, is there a way to map "work" to multiple cores?

kayvonf

To everyone: Each iteration of the loop is definitely not a program instance in the code above. Program instances are created by ISPC when a gang is launched (at the time of invoking the ISPC function from the calling C code). Each program instance runs the ISPC program(s) provided above -- this is SPMD ("single-program, multiple data" execution... where it would be even more accurate to call it "single program, multiple instance"). There is no parallelism at all expressed in the code above, but the code above is run in parallel by all the program instances. Take a look at the "quiz" we started this lecture with for further discussion on these differences.

This is a very, very important point.

PIC

There are loops that can be parallelized (through vectorization) that are not simply completely independent loop iterations, such as reductions. Will the ISPC parallelize these? To what degree will it resolve cross iteration dependencies?

Allerrors

@PIC I think ISPC will parallelize any loops that the programmer indicates (by using foreach keyword). The outcome would be wrong. It is the programmers' responsibilities to manage the 'cross iteration dependencies'.

lol

@kayvonf Can you give an example of parallelism expressed in code, to make the distinction more clear?

kayvonf

@lol. SIMD vector intrinsics specify parallel operations.

pthread_create() spawns a parallel thread. (technically that's concurrency, but if you have the free cores that thread will certainly be run in parallel.)

The point is that the foreach loop above is describing independent work. Not concurrent or parallel work.

lol

So the launch construct also describes independent work, but not parallel, due to the abstraction?

Is it that SIMD and pthreate_create() are not abstractions but actual implementations, whereas all of ISPC actually could be sequential?

randomthread

@lol I think you are right that the launch construct describes independent work, but I think we need to consider the purpose of such statements. I think launch is similar to pthread_create. With pthread_create we cannot guarantee parallelism, but if we know a core is available then we will get parallelism. Launch doesn't necessarily describe parallel work but we use it with the intent of getting parallelism.