Previous | Next --- Slide 31 of 66
Back to Lecture Thumbnails
lament

So, I didn't get exactly what was meant by this in lecture - I'm not quite clear on what the ALUs are suppose to be doing.

Do some of the ALUs not have the instructions loaded to perform the appropriate part of the conditional? So, of n many ALUs, x have the code loaded corresponding to if the conditional is true, and n-x have the code loaded corresponding to if the conditional is false?

Are the ALUs "precomputing" both parts of the conditional, and disregarding the one not matching the conditional? I seem to recall from lecture the answer to this was no, specifically considering: float x float y float z if(y != 0){ z= x/y } else{z = x} Couldn't the exception generated (by z = x/y when y is 0) just be disregarded?

I'm betting its neither of the above. Anybody?

BigFish

We feed 8 different elements in the array A to ALU1, ALU2, ..., ALU8. After computing the branch condition, for example, there are only three True value. Since in SIMD, we can only feed one instruction a time, when we are executing the instructions in the if clause, those in the else clause are blocked.

Faust

I believe that all of the ALUs must be performing the same instruction. Thus, in this example, I thought that at first, all the ALUs would perform the instructions for if case on their respective x values. At the end of performing the instructions for the if case we have a mask that blocks the results from the ALUs that should actually be running the else case. We then run the else case instructions on all the ALUs and reversing the mask such that it blocks the results from the ALUs that should be running the if case. I believe that the exceptions are just blocked by the mask but I may be wrong. Anybody else?

HCN

I'm still quite confused about how the mask works. If we translate the C code into instructions, we should see only one conditional jump at the beginning where the processor could know there is a branch and set the mask. However, there should be only unconditional jumps afterwards. How could the processor know when to reverse the mask? Are some flags stored in the context to achieve that?

afa4

My understanding of this slide is that: The condition (x > 0), the "if" part and the "else" part are all performed by each ALU. At the end, the part that matches the result of the condition is kept and the other part is discarded/masked. That is to say, SIMD processors eliminate branching completely and perform both the true and false parts. I think, this can be viewed as conversion of divergent execution to coherent execution.

Some information can be found in section 2.5 here

lament

@afa4 Ok. Hm. I guess there would not be a risk of side effects outside the ALU following this, would there be? What I mean is, since this is happening only in the ALU, we don't have to worry about different memory accesses occurring in the different branches of the conditional - or other things that cannot just be ignored, but must be undone. Really, since this is only occurring in the ALU, the only risks involve throwing up a flag /exception (like divide by zero), which can be ignored.

So, in short, this policy of ignoring the undesired result works because all operations performed were local to the ALU. Right?

toutou

Since now it is quite common for a GPU-accelerated processor to achieve a teraflop of computing performance, is it still necessary for us to do a lot of work to avoid branch divergence? After all, bandwidth is and will be the major bottleneck.

rojo

@toutou From my understanding GPUs can achieve teraflops of computing performance as they can use the multiple cores and SIMD units to perform computations. All of this would be achieved if some level of mechanisms like branch divergence can be minimized. Someone please correct if I am wrong.

Corian

So regardless of whether the results are masked or not, all ALUs still do the work? If so, then how are variables and flags reverted if they were changed during the useless work?

kayvonf

@corian: Let me be picky about terminology here: I believe you are referring to whether the contents of registers are changed (as opposed to variables --- that's a programming language abstraction). If the results of a lane are "masked" that means the execution unit's output is not written to the corresponding element in the destination vector register (and in the case of a load from memory its inputs aren't sourced either). Logically, it's the same if that unit was turned off for that clock and did nothing, so there's nothing to "revert".

kayvonf

@HCN. Excellent question. However, I'm curious if you're able to answer it now after working on Assignment 1 problem 2?

dumbo_

A worst case performance example: there are 8 elements in the input array, one of which follows the if branch and the rest following the else branch. And the 'if' branch has a million instructions and the 'else' branch has only 1. With masking, this will take each of 8 ALUs a million and one clocks, while peak performance would take a total of a million and seven clocks, which is approximately 1/8 the amount.