Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

billdqu

Answer to why we use reduceAdd here: update shared value diff which is used in all parallel workers.

kayvonf

Question: I'll claim that in this fictitious data-parallel programming language, the compiler is responsible for assigning iterations of the for_all loop to processors. Agree or disagree?

bjay

@kayvonf I'm going to agree here... Since the programmer doesn't use anything like programInstance to explicitly divide up the work among different instances, that would lead me to believe that something under the hood is taking care of the assignments.

maddy

@kayvonf I think I am going to disagree here. I think it is the system that handles both the assignment and orchestration.

jmc

I agree with @bjay that the programmer definitely isn't doing anything to specify how to assign iterations to processors here, and can't, because they don't have anything like the programCount variable we have in ISPC. I think @maddy's point is that even if the programmer isn't doing it, the assignment could be done by the system, rather than the compiler. I think this would mean the compiler translating the for_all into something like a system call to spawn some (system-determined number of) threads. I don't quite see why we would do this, so the compiler doing the assignment makes more sense to me.

maddy

To be honest, I am not sure if the system or the compiler does the assignment. But I feel since the compiler has only static information(maybe just the number of cores and their relative placement), sometimes it might make more sense for a more dynamic kind of assignment. Assignment can potentially depend on the relative workloads, when a core gets free etc. This information is not available at compile time.

asd

Agree. The system/compiler is incharge of assigning the iterations. I found this sentence that I felt supported my answer- "OpenMP is an API that implements a multi-threaded, shared memory form of parallelism. It uses a set of compiler directives (statements that you add to your C code) that are incorporated at compile-time to generate a multi-threaded version of your code. You can think of Pthreads (above) as doing multi-threaded programming "by hand", and OpenMP as a slightly more automated, higher-level API to make your program multithreaded. OpenMP takes care of many of the low-level details that you would normally have to implement yourself, if you were using Pthreads from the ground up."

Reference- http://gribblelab.org/CBootcamp/A2_Parallel_Programming_in_C.html

pdp

Agree. The programmer is just specifying that these parts of the code are independent and can be performed in parallel. The compiler will decide if and how many vector-wide instructions can be launched to perform the computations in parallel.

nemo

Just for correctness sake, shouldn't TOLERANCE be divided by 2 here since we are only accumulating the differences for red cells as compared to the entire grid, which is what we were doing in the sequential program?

yulunt

There is a trade-off between assignment done at compilation phase and assignment done at runtime. In this case, assignment by compiler can be efficient because each task executes the same amount of instructions. However, if there is a loop in each iteration and the number of iterations is dependent on input data, static assignment is not efficient because the machine is not able to make the most of resources when some tasks are finished earlier.

hzxa21

I think the assignment depends on which programming model we are using. For example, if we are using ISPC tasks, there are actually two assignments here. The first is break up the work to update all the red cells into tasks. Then within each task, break up the work to update red cells into program instances. The first assignment is done by the compiler and the system since ISPC uses task queue implementation so that whenever a core finishes a task, it will pick another task to run immediately. While the second assignment is done by the compiler because the compiler map work to a vector lane.