In lecture we came up with some examples where this wouldn't be efficient- when only 1 out of the 8 processors are doing work for a long time. Splitting into 8 cases or having a completely sequential program were some examples, but the simplest answer was to have one if statement that runs for a long time and have the 7 other ALUs go to the else which runs super quickly. This causes most of the time the program runs to only have one ALU doing work which is inefficient.
In the situation where the if case is really long, and the else really short - my understanding is that all 8 ALUs will run the "if", which will take a long time, and then 7 out of 8 ALUs will discard this and only 1 saves it (the only one that needs the "if" branch.) Then, all 8 ALUs run the "else", and the other 7 ALUs save that result. Because the 7 ALUs spend a long time computing the "if" branch and then never use the result, it is inefficient. Is this correct? How can you quantify how inefficient this is?
@srb A good way to quantify this is a vector utilization statistic. In assignment1, it is defined as the percentage of vector lanes enabled during a particular computation, assuming each SIMD instruction runs in one clock cycle. For a single SIMD instruction, we can have any number of vector lanes from 1 to N (N is the SIMD instruction width) enabled, with the remaining lanes sitting idle.
So, for an entire computation, you can add up all the vector lanes that were enabled and divide that by the total number of vector lanes across all instructions to get the vector utilization.
For example, if we examine the 3 lines of the if body and the 2 lines of the else body above, we treat the boxes with x's as an idle vector lane for that clock tick. In total, we have 19 enabled lanes with 40 total vector lanes, which yields a utilization of 47.5%.