Given that GPUs were originally for graphics, why did they design the hardware to be data-parallel in this way? Do graphics computations normally involve calling the same function several times on different input?
Also, is my_function here a single function in, say, a C program? Or is it a C program as a whole?
@0xc0ffe, yes, graphics computations have many unit functions that are applied to independent data. Say you have a triangle mesh with 5000 triangles, each triangle can be processed in parallel to generate pixels on the screen, though there are some data dependency on certain stages (operations on screen samples). If you are really interested in this, you should probably take a look at the OpenGL graphics pipeline.
Yes, graphics usually require a huge amount of repeated computations as most graphic computations are very large matrix operations (matrix multiplication for example).
I think my_function here can either be a function or a program. But the main point here is that GPU is able to run repeated instructions in parallel.
I am confused by the implicit SIMD. The way I understand explicit SIMD is that, the programmers use SIMD syntax to tell the compiler. And the compiler compiles vector binary code. What about implicit SIMD? What do programmers, compilers, and hardware do in this case? How does the hardware know that this program can be parallelized?
@pavelkang, Implicit SIMD means that the binary itself is scalar, but the hardware and/or runtime manages running that binary so that different ALUs compute the same instructions on different data. See OpenCL C for an example of what the programmer would write. I don't know about the specifics of GPU implementations, but I have worked on an OpenCL runtime for a system with both CPU and DSP cores on-chip. In that case, there was a software runtime on both the CPU and DSP side. A compiled scalar DSP binary would be copied into DSP memory dynamically, and the DSP runtime figured out how to divide the work among the cores. Each core would run the same instructions from the binary, but on different data.
Something interesting is that this scheme was not actually true SIMD in the sense of vector instructions, since each DSP core was executing an independent thread, but from the programmer's perspective it appeared as SIMD. You could write the same OpenCL C code you would use for programming GPUs, but run it on a DSP instead.
Have some thoughts and one question on explicit and implicit SIMD execution.
In explicit SIMD, programmer explicitly declares the program is a vector program, and it's programmer's responsibility to ensure independence and instruction stream coherence to achieve a nice performance. When a processor sees SIMD instruction, it'll instruct SIMD units to execute them. While in implicit SIMD, there's no such things like AVX instruction (vector add etc.). The processor works by running the same program in N copies, where N is larger than SIMD width.
Correct me if I'm wrong. Also, my question is, can I argue that explicit SIMD is less likely to encounter a "divergent" program problem if programmer ensure the coherence execution? Since in explicit SIMD it can get something like a guarantee from programmer, while for implicit SIMD there seems not such a thing. I wonder how implicit SIMD optimize on the instruction stream coherence issue.