Previous | Next --- Slide 34 of 65
Back to Lecture Thumbnails
sjoyner

Is divergence a "big issue" because the program won't make good utilization of the floating point vector units?

kayvonf

@sjoyner: It can be a big issue if you consider efficiency! A very wide SIMD width means it is more likely that you'll encounter divergence. (By the simple logic that if you have 64 things do, it is more likely that one will have different control flow requirements from the others than if you only have eight things do). When you have divergence, not all the SIMD execution units do useful work, because some of the units are assigned data that need not be operated upon with the current instruction.

Another way to think about it is worst-case behavior. On an 8-wide system, a worst-case input might run at 1/8 peak efficiency. While that's trouble, it's even more trouble to have a worst case input on a 64-wide system... 1/64 of peak efficiency! Uh oh.

So there's a trade-off: Wide SIMD allows hardware designers to pack more compute capability onto a chip because they can amortize the cost of control across many execution units. This yields to higher best-case peak performance since the chip can do more math. However it also increases the likelihood that the units will be not be able to used efficiently by applications (due to insufficient parallelism or due to divergence). Where that sweet spot lies depends on your workload.

3D graphics is a workload with high arithmetic intensity and also low divergence. That's why GPU designs pack cores full of ALUs with wide SIMD designs. Other workloads do not benefit from these design choices, and that's why you see CPU manufactures making different choices. The trend is a push to the middle. For example, CPU designs recently moved from 4-wide SIMD (SSE) to 8-wide SIMD (AVX) to provide higher math throughput.

mburman

Kayvon's comment on lecture 1, slide 30 might help.

Divergence in code execute (e.g., if/switch statements) results in SIMD lanes getting masked. This inhibits performance since you aren't using the entire SIMD width of the target architecture which effectively slows things down.

Edit: I guess Kayvon beat me to it :)