Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

lament

Question: to be clear:

Interleaved multi-threading:

one thread per core at a given time, but the core holds the context for other threads that are "waiting their turn"
at a given time, the ALUs are processing the data from only one thread

SMT:

one core processes multiple threads at a given time - the ALUs are crunching data from multiple threads simultaneously

Right?

mangocourage

To add on to lament's question, when does one have benefits over the other? Also are there situations in which hyperthreading hurts performance?

BigFish

What is the difference between Superscalar and Hyper-threading? It seems that they are both techniques to allow executing several instructions on a single core. Followed is my understanding. Superscalar is an architecture that allows to dispatch different instructions to the same core and execute them parallelly making use of different functional units. Hyper-threading seems to build upon superscalar architecture, which runs several instruction streams on the same core. So the instructions fed to the core can be from different thread. Is this correct?

kayvonf

@BigFish, this is a tough question so I'll jump in.

More generally, simultaneous multi-threading involves executing instructions from two different threads in parallel on a core. In class, I mainly described interleaved multi-threading, where each clock the core chooses one runnable thread and executes the next instruction in that thread's instruction stream using the core's execution resources.

But it's also possible for the core to choose more than one thread to run per clock. To fast-forward a bit in class, a more modern NVIDIA GPU, the GTX 680 is about to maintain state for up to 64 instruction streams (called "warps" in NVIDIA-speak) on its cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads executes simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock. (And if you look carefully at the slide I linked to, there is also super-scalar execution in that the core will try and run up to two independent instructions for each of those four warps -- up to a total of 8 overall -- each clock.)

Intel's Hyper-threading implementation makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core (this is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modiy our processor to have two threads available to choose instructions from instead of one!

Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!

So to summarize: Intel processors that support Hyper-threading maintain two execution contexts (threads) on chip at once. Each clock, the chip looks at the two available contexts, and tries to find a mixture of runnable instructions that best utilizes all the execution units the core has available. It might be the case that one thread has sufficient ILP to fill to consume the whole capability of the chip, and if so, the chip may just run instructions from that one thread, and thus it is essentially behaving like a processor performing interleaved multi-threading.

BigFish

@kayvonf It takes time to understand, but I got the point now! Thank you so much for the fantastic illustration!

jcarchi

Does multithreading have any downsides? All the discussion in class made it seem like there were little to no disadvantages.

One thing I've noticed is that multithreading makes benchmarks slightly inconsistent. For my example, I have a program (happens to be a parallel program) and I want to know how long it takes to run on two strings of length m and n. Running the benchmark with multi threading turned on seems to give less consistent results versus multithreading turned off. Why is that?

russt17

Am I right that the OS is responsible for choosing which processes are running on each core, but a core chooses which threads of a process are running on its ALUs? The OS doesn't play a role in scheduling threads?

kayvonf

@jcarchi: See the enumeration of downsides on the next slide.

jcarchi

Ah sorry I must have missed that! thanks!

HingOn

@russt17 I believe a single-threaded operating system schedules processes (PCB: process control block), and a multi-threaded operating system schedules threads (TCB: thread control block). In other words, OS scheduler manages both processes and threads.

crispyhexagons

Maybe this was already touched upon, but then how exactly are the choices made on which process or thread to run next? And wouldn't it take extra time to make these decisions instead of just running a new process or thread?

HCN

@russt17 I agree with @HingOn . Multithreading is a rather high level concept, multi-threaded programs can still be run even on a simple processor with only one ALU as in slide 9.

HingOn

@crispyhexagons The choice of which process/thread to run next depends on how the OS scheduler is implemented. The simplest scheduler is round-robin (all runnable processes/threads are placed in a simple queue).

Here are some other examples. http://en.wikipedia.org/wiki/Scheduling_(computing)#Scheduling_disciplines

The scheduler should be implemented to return very quickly to minimize the overhead. For example, the round-robin scheduler is O(1), a quick dequeue for next running process/thread.

lament

To get back to BigFish's comment from 3 days ago, I was under the impression (which heads-over-tails is wrong) that the distinction between superscalar execution and hyper-threading is this:

Superscalar execution allows multiple instructions to be run at the same time on ONE ALU

Hyper-threading primarily allows multiple threads to run on a single core at the same time by divvying the available ALUs so that each ALU is used by NO MORE THAN ONE thread at the same time.

Could somebody confirm or deny this? Sorry if this has already been asked and answered.

ekr

In HW1 we're told that the processors we're using utilize hyper-threading, but we should basically ignore this capability. (When) will we start using hyper-threading? Will it ever be a major part of an assignment that we need to heavily utilize for performance, or just something that is optional for further improving our programs in the hopes of meeting a speedup requirement?

kayvonf

@ekr: The machines in use for assignment do indeed have hyper-threading an the effects of hyper-threading are certainly visible in the assignment. If you create more than threads on a four-core Hyper-threaded CPU, it's possible to see a speedup over more than 4X vs. a baseline sequential program. Why?

yuel1

@kayvonf: is this because with hyper-threading, there could potentially be 8 threads running at once?

ananyak

@russt17 I'm guessing that the distinct threads that are being hyper-threaded don't actually have to be from the same process.

Can a user control hyper-threading settings in the CPU (for example which threads should be hyper-threaded together)?

kayvonf

@yuel1: That's true. There can be 8 threads running concurrently on a hyper-threaded quad-core processor at once, but why would that help? There's still the same number of execution units as a non-hyper-threaded processor!

rramo

@kayvonf I know this discussion was a while ago, but I wanted to try to answer your question about why having 8 threads helps despite having the same number of execution units as a non-hyper-threaded processor. Is it because now there's more instruction streams and thus a better chance of finding independent instructions to utilize all the available units? That is, we can find more ILP to exploit?

Sherry

@rramo I think you are right.