Previous | Next --- Slide 13 of 47
Back to Lecture Thumbnails
mitraraman

Since the GHC 5205 machines each have four cores, executing an ISPC program with a specified number of tasks expands the execution of the SIMD instructions from only one core to the four cores of the machine. As we can see in Assignment 1 when we execute ISPC with tasks, this achieve multi-core execution thus giving us a greater speedup than if we simply ran ISPC code without tasks.

tliao

If you look at the source code for the ISPC task system implementation, we can see that the compiler creates n-1 more pthreads where n is the number of cores. Each of these thread removes jobs from a shared task queue. It seems interesting that the implementation for the task abstraction that we use when writing ISPC programs, at least for assignment, actually uses another abstraction that we're more familiar with.

raphaelk

Going back to Problem 4, part 3, we realized ISPC implementation is actually slower than serial implementation when we have gangs of seven 1s and one 2.999999. First, it seems like it is impossible for ISPC to be slower than serial because serial has to run seven 1s and one 2.999999 to finish "one gang" while ISPC should only take 2.999999 amount plus alpha (interweaved with one "1" task, i think...) of time to finish "one gang" (because 1s finish earlier in parallel). Thus, only reason for the slow down must be overhead on using ISPC, work it takes to allocate tasks to separate SIMD cores (4 ALUs) or work it takes to gather returned data from different cores, or some other overheads. Are we going to talk about these overheads some time? And, how big are these overheads?

kayvonf

@raphaelk. There's one little detail you're overlooking (but you won't lose points for this since you got the gist of the question... which was to create an input that triggered maximally divergent execution). The reason for the slowdown is not ISPC "overhead" in assignment work as you suggest. Remember that we set the gang size to eight, so the ISPC implementation is going to run all eight instances for the entire 2.999 input control sequence. However, since the gang size is 8 and the SIMD width of the GHC 5201/5205 Intel CPUs is only 4, you end up running two instructions to implement each gang operation on all eight instances. As a result the ISPC code runs twice as many instructions as a sequential implementation of 2.999. Thus, the run time is much longer than a sequential implementation handling a single 2.999 input followed by seven 1.0 inputs. We were being a little tricky here. Your assessment of the situation: that overhead would be the only reason the ISPC implementation would be slower than sequential code, would certainly be correct if the gang size was set to four.