Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 28 of 46

kapalani

One of the reasons CPU clock rate could be increased was because of pipelining a single instruction into a sequence of many stages so that every clock cycle only a smaller amount of work had to be performed. So each instruction took effectively more clock cycles but it was running at a faster clock rate. In addition to power limitations, pipelines can't be arbitrarily large because each instruction can only be broken down into so many stages before each stage is no longer able to perform useful work and continue meet the timing requirements for the next stage

apr

To add to @kapalani's point, it's interesting to note how pipelining is different across different types of architectures. One way to split instruction steps is to split along the lines of Fetch, Decode and Execute into 3 steps. For pipelining these steps, if instruction #i is being executed, then #i+1 could be in the decode step and #i+2 in the fetch step. RISC architectures (simpler instructions, eg ARM) are more conducive to pipelining than CISC (complex instructions, eg x86.) For example, x86 which is CISC has variable length instructions, hence 1 instruction's fetch time may not match the next instruction's fetch time causing stalls.

Useful link: https://cs.stanford.edu/people/eroberts/courses/soco/projects/2000-01/risc/pipelining/index.html

One more point I would like to add is that while ILP executes strictly independent instructions in parallel, pipelining results in steps in possibly dependent instructions being executed in parallel. To prevent stalls in the pipeline due to this, branch predictors are used so that every part of the pipeline is doing useful work with some probability. For the more software inclined, something interesting I have come across are the likely and unlikely functions implemented and used in the linux kernel which allow the compiler to spew out instructions better suited for branch predictors.

Useful link: https://kernelnewbies.org/FAQ/LikelyUnlikely

sampathchanda

One other (probably not so significant) source of processor performance raises from the paradigm of Out-of-order execution. This is similar to the latency hiding that we do in later lectures by using hyper-threading. Similar to using multiple threads to keep switching context whenever one of the threads is waiting, in out-of-order execution processor executes the instructions in the order of their data being available. By this optimization, the waiting time on some instructions will be reduced, which will otherwise be blocking other independent instructions from getting executed.