Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

ragnarok451

Why is j also uniform? Since j computes the terms of the sin expansion and forms a distinct for loop in each program instance, I would think that each program instance would need its own version of j.

It seems like the program would work with it as a uniform variable, but potentially it would cause some slowdown, since every program instance would need to be at the same iteration of the for loop.

mingf

j is the number of terms that the Taylor expansion computes. It would be same for each program instances. Each program instances would be in the same pace when executing instructions. This is exactly what SIMD does.

Jing

but each program instance should maintain its own j, right ? Otherwise we will waste a lot of time locking and prevent the results from being wrong

jazzbass

@jing This is my understanding, I hope that this helps:

Uniform variables are shared across an entire gang. If we don't use uniform, a variable with a distinct storage location for each program instance in the gang will be created.

In this case we know that each program instance in the gang will execute the same number of iterations in the for loop, therefore we declare j as uniform. If we didn't declare it as uniform we would need to deal with the overhead of control flow divergence and dealing with masks (similar to what we did in assignment 1 problem 2), making our program more inefficient.

kayvonf

There are no locks in this program.

j is uniform because the fact that all instances carry out the same number of iterations is statically known. The program could be correctly written with j as a non-uniform per instance variable, but that would be less efficient, since that logic only needs to be carried out once per gang, not once per thread.

Note: Technically, I believe the program would not compile if j was just an int rather than a uniform int because you cannot directly compare a uniform int (terms) with an int (j). You'd have to copy terms into a per-instance int first.

BigFish

I agree with @mingf, we need to feed the same term computation instructions to different program instances.

ankit1990

@kayvonf A quick clarification on the text "All instances run ISPC code in parallel". Shouldn't it be "All instances run ISPC code concurrently."?

kayvonf

@ankit, yes, you are correct! I was sloppy. See, you caught me conflating abstraction with implementation. We all do it!

However, it's a very restricted form of concurrency due to the gang convergence guarantee.

The slide is now fixed! Good catch.

Jing

Why is it concurrent ? doesn't that all instances (on different lanes of SIMD) just perform the operations in parallel ? Sure each will have to wait for others to move on, but the very low-level they run in parallel, right ?

kayvonf

@jing: The ISPC language states that the instances in a gang run concurrently subject to a scheduling policy that preserves the gang convergence guarantee. There are schedules that preserve the guarantee without every instances executing in parallel. You also just conflated abstraction with implementation!

However, we know that the current implementation of ISPC does execute some instances of the gang in parallel.

Advanced Question: How does ISPC implement a gang of 16 instances on an 8-wide SIMD architecture. Do all instances execute in parallel? Do they execute concurrently? How does an implementation that does not execute all instances in parallel still respect the convergence guarantee? (If you can answer these questions you definitely understand how ISPC is implemented.)

Sherry

I'll take a try. Not all the instances execute in parallel; only 8 of them do. Yes, they execute concurrently. I think ISPC will execute the function twice actually, 8 instances each.

sanchuah

The 8-wide SIMD implementation of ISPC will execute 8 instances at once, then do another 8 instances.