Previous | Next --- Slide 34 of 65
Back to Lecture Thumbnails
lixf

The difference between Implicit and Explicit SIMD is still not clear to me. First of all, what's a scalar binary? Also, isn't executing a function N times similar to having a independent for loop for N times? but the first case is categorized as Implicit while the latter is Explicit?

RICEric22

I believe the difference is that in explicit SIMD, the compiler itself creates vector instructions (such as vstoreps and vmulps) which are single instructions that tell the machine to use multiple ALUs to deal with multiple data. In implicit SIMD, the compiler doesn't write the vectorized instructions and instead writes the serial instructions, and the CPU hardware looks at the instructions and figures out that it can use multiple ALUs to deal with multiple data.

So in the example, the forall loop is an explicit SIMD since when the compiler reaches there, it knows that it can vectorize the following instructions. When running a function N times, the compiler may not know if it can be vectorized, but when the CPU sees the instructions and decides to use multiple ALUs, then this is implicit SIMD.

tchitten

I'm not sure implicit SIMD is as magical as the CPU "figures out" to run scalar instructions as SIMD by looking at the instruction stream. This seems to suggest that regular code that may be parallizable but contains zero instructions to make it parallel would still be executed in parallel. My interpretation was that the CPU is explitly told to run a certain function N times and the hardware supports running each iteration in parallel, i.e. SIMD is inherent in the "run this function N times" bit.

kayvonf

A wide SIMD width means it is more likely that you'll encounter divergence. (By the simple logic that if you have 32 things do, it is more likely that one will have different control flow requirements from the others than if you only have eight things do). When you have divergence, not all the SIMD execution units do useful work, because some of the units are assigned data that need not be operated upon with the current instruction.

Another way to think about it is worst-case behavior. On an 8-wide system, a worst-case input might run at 1/8 peak efficiency. While that's trouble, it's even more trouble to have a worst case input on a 32-wide system... 1/32 of peak efficiency! Uh oh.

So there's a trade-off: Wide SIMD allows hardware designers to pack more compute capability onto a chip because they can amortize the cost of control across many execution units. This yields to higher best-case peak performance since the chip can do more math. However it also increases the likelihood that the units will be not be able to used efficiently by applications (due to insufficient parallelism or due to divergence). Where that sweet spot lies depends on your workload.

3D graphics is a workload with high arithmetic intensity and also low divergence. That's why GPU designs pack cores full of ALUs with wide SIMD designs. Other workloads do not benefit from these design choices, and that's why you see CPU manufactures making different choices. The trend is a push to the middle. For example, CPU designs recently moved from 4-wide SIMD (SSE) to 8-wide SIMD (AVX) to provide higher math throughput.