Previous | Next --- Slide 30 of 65
Back to Lecture Thumbnails
cardiff

An example from lecture exhibiting nearly the worst-case performance is as follows: Imagine there 8 elements in the input array, one of which follows the 'if' branch and the rest following the 'else' branch, where the 'if' branch has a million instructions and the 'else' branch has only 1. With masking, this will take each of 8 ALUs a million and one clocks, while peak performance would take a total of a million and seven clocks -- approximately 1/8 the amount.

yanzhan2

I agree with the explanation. But the peak performance should be interms of the useful work done by the ALUs, not the number of cycles, I think.

paraU

In the worst case, the performance is almost the same as a core with only one ALU. It seems adding ALUs is really a good method to speed up CPU. Any tradeoffs? (Of course, it's slightly more expensive)

bxb

@paraU This seems to tie a lot into ISA design. Will the ALU dispatch be a decision for the programmer (ie. SIMD instructions) or operated solely by hardware (ie. superscalar architectures)? As more of the implementation is hidden from the programmer, the hardware logic to manage the data going in and coming out of them becomes more complicated. And if there are enough dependencies in the code, the CPU waits even longer to dispatch data. Mix this in with out-of-order execution/reordering logic, multi-threading, branche prediction, etc. and now there is a host of indirect decisions to be made to having more ALUs. I guess what it really comes down to is how able can the CPU make use of the ALUs.

clairechan

In regards to cardiff's example, what if all elements followed the else case?

elemental03

If all elements followed the else case in @cardiff's example, i.e., there are no inputs that would trigger the 'if' clause, then it should be that it would take each of the alu's 1 clock right? Would the peak performance be 1 clock or 8 clocks in that case?

kayvonf

The code on the right is a single sequential sequence of control. The variable x is meant to be single floating-point value (e.g., 'float x'). Now assume that this code might be the body of a loop with independent iterations, such as the forall loop I showed on slide 17. So let's say we're running this code for each element in an array 'A', and for each iteration of the loop x is initialized to A[i].

Now, we know we have SIMD execution capabilities on a processor and we'd like to make use of them to run faster. However, we have a program that's written against an abstraction where each iteration of the loop has it's own control flow. No notion of SIMD execution is present at the programming language level.

The problem at hand is how to map a program written against this abstraction onto the actual execution resources, which have constraints not exposed at the language level. (For example, the mapping would become trivial if if statements were not allowed in the language, forcing all iterations to have the same instruction sequence.)

On slide 33 I point out that the act of mapping such a program into an efficient SIMD implementation can be responsibility of a compiler or it could be the responsibility of the hardware itself (as in slide 34). One example of the former case is the ISPC compiler. It emits SSE or AVX instructions into its compiled programs that generate lane masks, use the lane masks to prevent certain writes, and also generates code that handles the branching logic in the manner we discussed in class. The processor (e.g., the CPU in the case of ISPC) just executes the instructions it's told. There's no smarts needed. All the smarts are in the compiler.

On the other side of the design spectrum, the compiler can be largely oblivious to SIMD capabilities of hardware, and most of the smarts can go into the hardware itself. To date, this has been the GPU-style way of doing things. Although we'll talk about GPUs in detail later in the course, here's a good way to think about it for now: The GPU will accept a mini-program written to process a single data item (e.g., you can think of the mini-program as a loop body) as well as a number of times the mini-program should be run. It's like the interface to the hardware itself is a foreach loop.

rokhinip

Also, with reference to @elemental03's question, I don't think we will gauge performance based on clock cycles (I don't quite know what you mean by an ALU clock). If all lanes of the 8-wide vector go to the else case, then there is no ALU that is idle since all of them will be executing the code block for the else clause. So we still achieve peak performance (all 8/8 ALUs are running)