Previous | Next --- Slide 32 of 69
Back to Lecture Thumbnails
byeongcp

I have some questions to make sure if I understood ISPC abstraction correctly.

  1. My understanding is that programCount determines how many program instances are spawned by one gang, and one gang is mapped to one core. So for example, for our assignment, the programCount would be 4 since the compiler will use 4-lane SSE vector instructions (so 1 program instance would map to 1 lane), and all of these 4 program instances are in one gang which is mapped to one core. Is this correct?

  2. Is it the case that if we don't use the "task" abstraction, we will only use one core? If this is the case, then would incrementing the programCount the only way to attempt to increase speedup?

Also, here is a link that briefly summarizes the ISPC Tasking Model: https://ispc.github.io/ispc.html#tasking-model

kayvonf

@byteongcp: Your post is correct. Increasing programCount from 4 to 8 is done by changing the ISPC output target. For example changing from --target=sse4-i32x4 to --target=avx2-i32x8 would compile your ISPC code to 8-wide AVX instructions rather than 4-wide SSE. This could result in speedup. In fact, just change the flags in the Assignment 1 Makefiles to do so.

https://ispc.github.io/ispc.html#selecting-the-compilation-target

Advanced comment: ISPC also offers a programCount=16 target --target-avx2-i32x16 which uses two consecutive AVX instructions to execute an operation for each of the 16 instances in a gang. In most situations --target-avx2-i32x16 delivers better performance than --target-avx2-i32x8 even though they use the same number of SIMD execution units. The reason is a bit complex and involves superscalar execution and how a modern instruction pipeline works. Since each pair of instructions in the instruction stream is independent (they work on a different set of 8 instances), it's a much friendly instruction sequence for the processor to schedule on its resources.

Question: Increasing the programCount further on a AVX wouldn't yield much benefit. Why?

Compiling Assignment 1 program 4 using your base-case input using --target-avx2-i32x16 and creating tasks to fill are cores provides quite a substantial speedup.

VP7

This may be slightly off-topic, but I feel it still is with in the spirits of this lecture- content.

Is it possible to combine OpenMP and ISPC code? If yes, how different are the following 2 scenarios?

  1. Using #pragma openMP parallel num_threads(4) and calling an ISPC-function with no tasks on a Quad Core x86 - Unix target?
  2. Calling an ISPC-function with 4 tasks on a Quad Core x86 - Unix target.
kayvonf

Yes, (1) and (2) would be very similar. An evan more interesting question is what if you replaced 4 in your scenario above with 10000?

VP7

Would n't that be the answer for the extra credit question?

kayvonf

Yep.

dsaksena

Today, we were finally given the difference between a task and a thread. So from my understanding, a task is also like a promise that "here are a set of tasks that can be performed by a gang of instances in parallel".

foreach was a promise that "here is some work which can be performed by different instances in a gang in parallel."

Nowhere did I contact the OS, I have simply made promises to the ISPC compiler.

ISPC apparently gives us the ability to make promises to the compiler and compiler takes care of the implementation.