Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Previous | Next --- Slide 30 of 43

Instructor Note

Instruction level parallelism (ILP) is a property of a program (it is determined by the dependencies between the program's instructions). Therefore, ILP is a measure of how many instructions can potentially be executed in parallel. A particular processor with superscalar execution capabilities will only have so many execution resources, and when there is more independent work than execution units to do it, then the processor's capabilities are the limiting factor.

However, as the next slide shows, for many common programs, it's hard to find many independent instructions in a single instruction stream. In these situations, no matter how many execution units a core has, performance will be limited by the fact that the program does not have sufficient parallelism.

haboric

I searched online and found that there are two types of parallelism: dynamic parallelism and static parallelism. For dynamic parallelism, processor decides which instructions to execute in parallel at runtime; whereas for static parallelism, complier decides which instructions to execute in parallel at compile-time. (src: https://saimulticorecomputing.wordpress.com/2013/10/24/dynamic-and-static-parallelism/)

Question: Does complier exploiting static parallelism count as superscalar execution?

karima

There was an interesting brief conversation in class about whether it should be the hardware or compiler's job to find parallelism in a program.

The cost of having the hardware find instruction level parallelism is the extra hardware and cost of finding ILP at runtime.

Very long instruction word (VLIW) is a processor architecture that allows programs to tell the hardware which instructions should be executed in parallel. This makes it the compiler's job to find instruction level parallelism and not the hardware. Proponents of VLIW will argue that this allows for more simple elegant chip design and, when given code that fully utilizes the hardware, better performance since no runtime is spent on finding ILP.

VLIW architectures are probably best for embedded systems used for media processing tasks such as image processing. Such tasks have a lot of inherent parallelism that makes it easy for the compiler to generate code that will fully utilize the processor's capabilities.

ppwwyyxx

With the case of instruction-level optimization, I think it's totally possible to know what instructions can be executed in parallel in compile time (aka, a static optimization), instead of in run time. In fact I think a compiler should be able to do a better job than CPU since it knows about high-level program semantics. But Intel doesn't want to break compatibility for this feature by introducing a new architecture. I guess that's why ILP is now built inside CPU.

It's different from other cases of optimization, where run-time behavior might provide more information so dynamic optimization might perform better than static optimization. For example, languages that run in virtual machines (such as Java) can make use of run-time behavior to optimize branch prediction or memory allocation.

nmrrs

In case anyone's interested, scoreboarding is one of the algorithms for implementing instruction-level parallelism. In case you don't want to read through all of that, TL;DR scoreboarding basically keeps track of all of the data dependencies for the instructions seen. Instructions are only allowed to execute when they don't have any conflicts with instructions already being executed, and will stall until all data dependencies are resolved.

zl9394

Tomasulo algorithm is the updated version of scoreborading algorithm by utilizing Register Renaming mechanism.

pandu

I was interested in knowing when superscalar processors became mainstream, and I found some information here. Although the first computer with superscalar design was the CDC 6600 in 1966, which was the fastest supercomputer of its time, the first commercial microprocessors appeared between 1988 and 1990. This matches with the graph shown on slide 33. The purple line on that graph shows the amount of instruction-level parallelism, and it starts increasing around 1990, which is when these new superscalar processors came out.

PIC

@ppwwyyxx Compiler can in fact exploit a good amount of ILP. The debate about how much of the work should be done by the compiler and how much should be done by a processor has long history. Here you can find a number of algorithms to exploit ILP through "Instruction Scheduling",

http://www.cs.cmu.edu/afs/cs/academic/class/15745-s15/public/lectures/L17-Intro-to-Scheduling.pdf

An interesting aspect of the compiler vs processor debate is about who should be in charge of managing registers. On Intel's architectures only exposes 16 registers to the users even though the register file actually contains a hundreds of registers, this allows the processor to do a lot of the register renaming. On RISC architectures (ARM (sort of)), the registers are mostly all exposed to the user which offloads the burden to the user.

0xc0ffee

Woah, I had no idea Intel arch's had more registers than it exposes to the user.

https://en.wikipedia.org/wiki/Register_renaming has a cool example about instruction level parallelism.

Key quote:

"When possible, the compiler would detect the distinct instructions and try to assign them to a different register."

That's crazy. I had no idea this happened.

chuangxuean

Suppose we have a compiler that figures out certain opportunities for parallelism in the source code before producing the target assembly and one that does not. How does the assembly produced by one compiler give rise for opportunities for ILP compared to the other?

zvonryan

When we talk about instruction scheduling, I think it would be interesting to think about the allocation of registers. We know that it is a important job for a compiler to optimize through instruction scheduling as well as register allocation to exploit the capabilities of processors. The relationship between these two is worth putting some thought into. If we allocate registers first, some data hazards might be faced, thus introducing some contraints to the scheduling comes later. If we do scheduling first, the compiler might not reuse the registers well, for example, use two registers simultaneously when one would suffice. The number of registered is limited and should controled carefully.

I found a paper that addresses this problem (the phase order of Instruction Scheduling) in a kind of intuitive way, combining the phases.