Previous | Next --- Slide 21 of 69
Back to Lecture Thumbnails
iZac

Execution context of each ALU must be having limited number of registers, so is there any limit on number of local variables that can be used by each instance?

lament

I think you mean the execution context of each thread, if the gang of program instances use threads - ALUs by themselves don't have execution contexts, well, to my knowledge.

iZac

I should rather put it as, Execution context of each instance(in a gang) have only registers (I believe no caches). So, Is there any limit on number of local variables that can be used by each instance?

HingOn

@iZac I assume that you are asking about if there is a limited number of varying (non-uniform) variables. I think a gang shares one execution context and the execution context stores a set of its own state (like a stack pointer). So, each instance can push some varying variables onto the stack if all registers are in use and pop them back later on. According to the link below, the compiler allows each instance in a gang to access memory (eg. use stack) sequentially.

https://software.intel.com/sites/default/files/ISPC-a-SPMD-Compiler-for-Xeon-Phi-CPU.pdf

iZac

My understanding is that, there is one execution context for each instance rather than for whole gang. Kayvon has this illustrated in detail here Lecture 2, Slide 36.

toutou

I don't find where the ISPC_function called is explicitly specified the total number of instances. So how does the complier know how many instances should be launched, and when all instances will be completed?

Faust

@toutou I'm a bit confused about this as well, but here's what I believe is happening. In this program, programCount is the number of instances per outer loop iteration. Although we don't know the exact value of programCount, the important thing is that we know that N is divisible by programCount. So the compiler will launch programCount instances. Because each instance will only return once when it is done working, when the compiler receives programCount returns, it will know that all the instances are done.

zeppelin

@toutou My understanding is that when nothing is explicit, the total number of instances matches is equal to the SIMD width of the hardware. Otherwise, you can add an explicit flag to the compiler as Kayvon mentions in the comments on the previous slide, like --target=avx2-i32x16.