Previous | Next --- Slide 28 of 69
Back to Lecture Thumbnails
rohitban

I was just a little confused with the ISPC implementation specifics: do the programCount gangs start up whenever control is transferred to the ISPC function or only when the function executes a loop. For example: if we were to specify an ISPC function with no loops(no for/foreach or tasks) and a programCount = 4, would the compiler still create a gang of 4 program instances and have 3 of them idle or would it simply start a single program instance?

aznshodan

From what I understand, the gang of instances start up when the control is transferred to the ISPC function from C++ code. And I believe it is up to the compiler to decide how many instances it should create. For example, in this code when we use 'for each' loop, it will generate a gang of instances. The number of instances may or may not be 4 but let's say it generated 4 instances. Then it will be implemented using SIMD implementation by either using interleaving assignment of elements to instances(Slide 21) or using block assignment of elements to instances(Slide 24), but since interleaving is more efficient, the compiler will choose to implement via interleaving assignment of elements to instances.

kayvonf

The gang of instances is created at the time of the ISPC function call. The number of instances is set at compile time (e.g., 4 on assignment 1). The call returns when all instances in the gang are complete. foreach does not create instances, it merely states that "all instances in the gang" should cooperate to perform all the iterations of this loop.

The implementation of foreach (e.g., how iterations are assigned to program instances) is not defined by ISPC. ISPC reserves the right to perform that assignment in whatever way it wishes.

iZac

Does ISPC tell about ways/ algorithm it uses to assign instances? I mean to ask, do they specify any performance guarantee with respect to the most intuitive solution?

EDIT: Found this helpful. Mentioned by Kayvon in previous slides.

wtwood

@rohitban's question: If we were to specify an ISPC function with no loops(no for/foreach or tasks) and a programCount = 4, would the compiler still create a gang of 4 program instances and have 3 of them idle or would it simply start a single program instance?

Strictly speaking, neither. In the case you described, the ISPC function would spawn a full gang of instances, and each instance would run the function concurrently. No instance would idle. So for example, if you wrote the ISPC function:

export int mult(uniform int a, uniform int b) {

uniform int c = a * b;

return c;

}

When this got called, each of the four program instances would run the code. They would each multiply a * b, and the ISPC gang wouldn't return until all four instances had finished this calculation.

kayvonf

@wtwood. Your post's description of the meaning of an ISPC program is correct. However: Since the expression you have written is uniform, if it is more efficient to do so on the target architecture, the ISPC implementation may utilize this static knowledge about the uniform nature of the computation to implement the logic using scalar operations on scalar registers (rather than vector operations on vector registers). This does not change the meaning of the program, but it is a slightly different implementation than the logical interpretation you provided entails.

rbandlam

I liked the idea of giving intelligence of assignment of loop iteration to particular program instance residing with ISPC. Eventhough, current ISPC implementation does static interleaved assignment, it can exploit this assignment later based on amount of work to be done for each value of loop. For example, in sqrt example of assignment1, we observed that for some values there are lot of iterations to be done and for some values very less number of iterations to be done. If we put numbers that have to do almost same amount of work in same gang, vector utilization can be improved tremendously but how to identify those numbers that have to do same amount of work from the input is the tricky part!