Slide View : Parallel Computer Architecture and Programming : Tsinghua Summer 2017

Previous | Next --- Slide 41 of 86

wuhaozhe

Can i say that there may be two instructions executed in one thread?If so, whether should there be more than one ALU to execute two instructions?(That's completely different from what i know before)

kayvonf

In my slides, execution units (also referred to as ALUs) are capable a performing a single instruction. Sometimes that instruction is a regular operation on single values, or is could be a SIMD operation on vector values.

If a processor supports superscalar execution then it absolutely can perform multiple instructions from the same thread in a single clock. (That is the definition of superscalar!) For example, a 2-way superscalar core needs the ability to decode two instructions per clock (two fetch/decode units) and two execution units to execute those two instructions.

You see an example of such a processor here, here, and here!

Kaharjan

In my understanding, Superscalar have more than one fetch/decode unite, so it could find parallelism automatically in instruction level. SIMD have one fetch/decode unite but several ALUs, so it can run the same instructions on different data, but controlled by compiler or programmer. Am I correct?

kayvonf

@Kaharjan. Your statement is correct!

lapack

I want to make sure I understand several basic concepts correctly including pipeline, superscalar, SMT. Firstly, all of them are used to exploit ILP. Pipeline can improve CPU throughput not latency by using different hardware units simultaneously, such as IF, ID, Ex. Superscalar duplicates execution units and pipelines, for example two pipelines and two integer units, so it can execute two add instructions simultaneously, but CPU with one pipeline and execution unit can only do it sequentially. SMT adds more thread context but not execution units, it can exploit ILP across multiple threads rather than only one thread. This can't be done by superscalar. Let's take Pentium 4 as example, it has two Integer units and FPU, so I guess it has two pipelines and it's superscalar CPU. It seems not a SMT cpu. Am I right? Thanks!

kayvonf

@lapack: comments are below.

"all of them are used to exploit ILP"

You wrote ILP here, but you really meant parallelism in general. ILP specifically is used to mean independent instructions in a single instruction stream. Hardware multi-threading is not exploiting independent instructions in an instruction stream... instead it is exploiting the property that the program features multiple threads of execution, or in order words, multiple instruction streams. It does turn out that when processing a single instruction stream, pipelining (although we have not talked about it in class yet), and superscalar execution depend on there being ILP in the instruction stream.

Your comments on pipelining the execution of instructions are correct. However, we have not discussed this in class so far. (We've only assumed that instructions have some latency.)

__To implement superscalar execution_ a processor must execute multiple instructions per clock (it needs more execution units) as well as the ability to fetch/decode more than one instruction per clock to feed these execution units. So I agree with you when you wrote "superscalar duplicates execution units and pipelines".

SMT: When you wrote SMT (simultaneous multi-threading) you actually mean interleaved [hardware] multi-threading. Please see my detailed description here and these review slides here. However, if you had written interleaved multi-threading out would have agreed completely with what you wrote.

"it can exploit ILP across multiple threads rather than only one thread"

Using the term ILP is not correct here. Multi-threading takes advantage of the fact that threads, by definition, are different instruction streams. A chip that runs more than one thread concurrently is exploiting thread-level (coarse) parallelism in the application, not instruction level parallelism without one threads.

Finally, you are correct that the Pentium 4 only supported one execution context and therefore did not support hardware multi-threading.