What does it mean by "N instances of the program are always run together on the processor?"? If not the compiler, who identifies that the for loop is executing the same instructions for different data and hence need to run them in parallel? Is it the hardware?
Please correct me if I'm wrong, but my understanding of what is stated on this slide for "Implicit SIMD" for the GPU is that: one binary of the program of interest is created and then this program is run N times simultaneously. As these programs are run at the same time with the single binary (so same instruction executed at each step), it is the responsibility of the hardware to ensure that each of the program instances is applied to different data (for example: different iterations of a parallel for, etc.).
In lecture we talked about vectorized instructions replacing normal instructions to indicate that multiple instances of the program should run together. It would have to happen either in the code or via the compiler (see slide 29) and the hardware just executes the vectorized instructions on the SIMD ALUs with different data.
@pdp Just like the example. Execute my_func N times. The instructions for this functions are always same so that then can run in parallel. Although there maybe conditional branch, we can discard the useless portion of results. Please correct me if I'm wrong.
I am also a bit confused at the notion of "implicit SIMD". My understanding is that, the program instance might not necessarily contain any instructions that could be easily parallelized. However, it's the execution of multiple instances that is parallelized.
@ycc I agree with this understanding. Each loop iteration may not have anything inside of it that is parallelizable, but as long as each loop is independent of one another, then each program instance can perform the task on a different set of data within the loop.
@yyc, bjay. Correct. The program is a completely sequential program that consists of scalar operations. However, many instances of the program are run by the hardware at the same time each with a different value for their "instance id". Since the hardware knows it is always running multiple instances of the same program, it can use SIMD processing techniques to execute those instances efficiently. Rather than the compiler generating SIMD instructions for a machine that executes a SIMD ISA, in this case, the ISA is in fact scalar, and the hardware at runtime is assigning instances to lanes of its SIMD ALUs.
It has to be noted that, though the terminology suggests that execution units in a (NVIDIA) GPU are named as CUDA cores, it doesn't have a separate Fetch/Decode stage leading to it behaving similar to SIMD units, in spite of its naming convention. Hence, there would be divergence issue with these CUDA cores too.
How does the hardware know that it is running multiple instances of the same program?
@themj. Consider your experience with CUDA. How might the system get this information given the syntax of CUDA.