A divergent execution results in processor inefficiency due to its executing elements (ALUs) are unable fail to share the same instruction.
I believe that the previous slide shows an example of divergent execution.
Divergent execution can occur if there are conditionals (like the previous slide) in the code or if there are some elements that take longer to be operated upon. Because not all instructions are shared across lanes, ALUs would have to wait at certain points thus decreasing efficiency.
How is coherent execution implemented? Does the programmer indicate that some code has instruction stream coherence or is it inferred by the compiler or the hardware?
Is it correct to say that while coherent execution is not necessary for parallelization across cores, it is necessary to make use of the multiple ALUs within the same core?
Divergent execution on the same core leads to poor utilization of the ALUs as some of the outputs are shut off. In other words, coherent execution, though not a necessary condition for parallelization, would give you much better utilization.
I was curious about cache coherence, so I did a little searching. Briefly, it looks like cache coherence is making sure that separate caches of a single resource maintain consistency and coherence each other. So it's similar to instruction stream coherence in that each is concerned with keeping things consistent with each other, but instruction stream coherence is concerned with the execution of the instruction sequence and cache coherence is concerned about the memory itself.
SIMD execution divergence causes poor utilization when lanes have very different amounts of work. Like in assignment 1, problem 2 with clamp multiplication.
And what can we as a programmer do to increase instruction stream coherence? In assignment 1, we observed the effect of vector width on execution divergence probability. Any other ways to avoid divergence? Is there anything the compiler can do? Can the idle lanes be utilized with other work while waiting for busy lanes to finish?
@rootB, no the idle lanes cannot be utilized to perform different instructions.
One way that I could think of to avoid divergence, given enough data, is to arrange the data and sequentially apply multiple SIMD instructions on that data. For instance, assume that you wanted to apply two different filters to the odd and even rows of an image with 256 rows; instead of a single stream with 50% utilization, you could split the data into odds and evens and then execute both operations with 100% core utilization.