Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

eosofsky

What this diagram shows is that in the case of conditional code, the work on the branch that an element does not take is discarded (shown by the red X's). Every instruction on every branch is still executed for all columns in the diagram. A few methods for discarding the output were named in lecture including doing the work but not writing the results, and turning off an ALU so that nothing happens. In addition to not storing the results, discarding the output includes ignoring any exceptions that might have been triggered by the instruction had it not been discarded.

life

I am wondering what will happen if all elements in input array A are less than or equal to 0? Under this situation, the worst case peak performance should be 0 instead of 1/8. This clearly goes against what the slide says. Does anyone have an idea?

pdp

@life: In that case, it doesn't need to execute the T logic in any of the ALUs and hence, shouldn't the performance be at peak (1)?

yeq

@life: In the case you mentioned, the conditional code is redundant, because it doesn't need the if-branch at all. Any modern compiler should be able to optimize that out.

chenh1

We have to notice that not all ALUs do useful work because the core can only fetch and decode exactly one instruction at the same time. When there is a branch, the core must decide to execute one condition at first and the other later. The performance will be 50% if there are 2 branch. If all the data choose same branch, this is equivalent to no branches and performance will be 100%. The worst case occurs when there are 8 or more branches and all 8 data points which should be processed simultaneously hold different branches.

Vincent

I am wondering what if there are several conditions (e.g., switch), how can we implement those actual instructions because the bit mask cannot work at that case. How to determine which result from ALU is valid?

nnx

@yeq: I am a little confused as to how the compiler would know to optimize those instructions out. I think I understand how that would work if it was an if statement in which the clause is always true or always false (ex. abs(x) < 0). However, for clauses in which it depends on the input, how does that work? Ex. input: graph; If statement: number of neighbors of a vertex > number of vertices / 2.

Is there some way to notify the processor that all iterations are running the if instead of the else loop (in case the else loop is the time consuming loop)?

anon

@nnx I don't think so, because if you consider the next slide where we created the worst-case example, the program needs to execute/ wait for all branches to finish running.

ask

@Vincent I think one possible way to implement the instructions in the case of a switch statement would be to visualize the switch statement as an else-if ladder and have one bit mask for each of the cases you want to switch on.

aabhagwa

@nnx: I think you're right, the compiler won't always be able to statically optimize away conditional branches. If all the conditional branches happen to evaluate to 'true', I think the runtime/processor will check for this and skip the else-branch (I remember Prof Kayvon saying in class that SIMD processors usually have an instruction to check if a vector is all-ones or all-zeroes).

nishadg

Translating these independent loop iterations into vector instructions, as done here as well as in our first assignment, reminds me of using Matlab and optimizing to take advantage of large scale vector/matrix operations to manipulate data.