Previous | Next --- Slide 3 of 79
Back to Lecture Thumbnails

For 2, there were a bunch of factors that prevented us from obtaining maximum speedup. As Kayvon pointed out in lecture, one such factor was improper load balancing where one thread might be given most of the work but the others might be given just a small fraction. Another factor was communication overhead where once the threads had finished they took some time to communicate their results with each other. Another was sequential access to a shared resource. One other factor was that some work is inherently sequential so it's not possible to do it in parallel.


I believe that for the first question, the slowdown in improvement has come from "the power wall" as described last lecture. This wall comes from the physical limitations of power consumed by transistors in the chips. Refer to this slide and the next few.


Shared Resource
Inherently sequential work
Load balancing

  1. Because of power wall, increasing clock frequency is becoming difficult.

  2. Communication, Work Distribution.


For question 2. could we also give this as a reason? If we try to make the task 'too parallel' like divide a sum into multiple parallel computations, the overhead of adding the results of the computation surpasses the improvement obtained from parallelism. Can this stand as an independent point or comes under one of the ones described by the above?


For self check question 2: There could be systems with a superscalar architecture that do not restrict themselves to the execution of a single instruction stream. For instance, if we look at fine grained multithreading in a pipelined superscalar processor, multiple instruction stream execution is performed at the cost of increasing latency of each instruction stream execution. Although, overall throughput is benefited with this approach since stalls can be avoided by minimizing dependences.


For self-check question 1: a set of instructions that can only be fetched one at a time. CMIIW


For self-check question 2, I thought when we talked about superscalar execution we were specifically talking about optimizing the performance of a single instruction stream. The idea of being superscalar was giving the processor one stream that it would look at and figure out what order to process instructions in (possibly doing some instructions simultaneously which could improve performance) so that when we observe the program from a higher-level, everything seems to happen in the order that we wrote it.


@mrrobot what exactly do you mean by "too parallel"? Maximum speedup would mean if there are p processors then you divide the task into p parts to achieve a speedup of p. If you try to divide the task into more parts than there are processing elements, then it might probably lead to a load imbalance. Adding the results will have to be sequential at some point which will prevent the system from achieving the maximum speedup. So it invariably gets covered in the drawbacks mentioned above.


@violet, I mean higher the value of p, higher will the number of partial results produced. At some point the overhead of adding the partial results may offset the benefit of raising p. I think the sequential mentioned above was parts of the program that cannot be parallelized, not adding the partial results produced due to parallelization.


Superscalar execution does refer to optimizing the performance of executing a single-instruction stream. It involves the processor dynamically finding instructions in the stream and executing them in parallel


For 1, I think this article provides a great summary of why improvements to clock speed and power have plateaued in recent years, especially for people like me who are less familiar with the ECE side of processors.


I think among the major reasons were communication and work distribution.


For self-check question 2, the answer should be yes.

The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core):

instructions are issued from a sequential instruction stream; the CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time); the CPU processes multiple instructions per clock cycle.


[q1] As per lecture 1 slide 33/43 (1) The rate at which processor clock rate increases leveled off around 2005. (2) It's possible that we cannot further increase transistor density due to breakdown of Dennard scaling also around 2015.

[q2] I agree with PandaX. Communication and work distribution costs slowed the process down.