I wanted to make sure my understanding of SMT is accurate. For SMT to be supported, the processor should be necessarily superscalar with multiple execution contexts, since we need multiple threads to run at the same time. Is this reasoning sound?
@paracon yes, but to be more precise I'd change the last part of your sentence: we need to run multiple instructions from different threads at the same time (rather than just threads).
Hopefully the following is a helpful summary:
Simultaneous multi-threading involves executing instructions from two different threads in parallel on a core. In class, I mainly described interleaved multi-threading, where each clock the core chooses one runnable thread and executes the next instruction in that thread's instruction stream using the core's execution resources. However I do provide an illustration of simultaneous multi-threading on slide 84.
To fast-forward a bit in the lecture, a more modern NVIDIA GPU, the GTX 1080 (see slide 61) is able to maintain state for up to 64 execution contexts (called "warps" in NVIDIA-speak) on its cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads execute simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock. (And if you look carefully at the slide I linked to, there is also super-scalar execution in that the core will try and run up to two independent instructions for each of those four warps -- up to a total of eight overall -- each clock. For those interested in more detail: the other instruction needs to be a non arithmetic instruction, I didn't show load/store units in the figure.)
Intel's Hyper-threading implementation makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core (this is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modify our processor to have two threads available to choose instructions from instead of one!
Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!
So to summarize: Intel processors that support Hyper-threading maintain two execution contexts (hardware threads) on chip at once. Each clock, the chip looks at the two available contexts, and tries to find a mixture of runnable instructions that best utilizes all the execution units the core has available. In rare cases, one thread might sufficient ILP to fill to consume the whole capability of the chip, and if so, the chip may just run instructions from that one thread. In this case, the chip is essentially behaving like a processor performing interleaved multi-threading.
Finally, it might also be instructive for students to note that the motivation for adding multi-threading in an Intel CPU (called hyper-threading) was different from the motivation for large-scale multi-threading in a GPU. GPUs feature many execution contexts for the primary purpose of hiding memory latency. Intel HyperThreading isn't really intended to hide all memory latency (it only has two threads, and that's not enough to hide the long latencies of out to memory). Instead, Intel HyperThreading exists to make it easier for the core's scheduler to find enough independent instructions to fill the multiple ALUs in a modern superscalar Intel CPU. In other words, the thinking was: if there is insufficient ILP in one thread to occupy all the ALUs, why not keep two threads around to draw instructions from. Of course, multi-threading does hide some latency when one thread stalls on memory access, but unlike GPUs, CPUs multi-threading is not intended to hide a significant fraction of memory latency.
I have a few questions based on the above explanation:
0) Do both SMT and Super scalar processors have the same amount of computation resources? For example, will a 2 thread hyperthreaded processor and a superscalar processor that has 2 fetch/decode units also have the same number of ALUs?
1) So, if the above assumption is right, then does it mean that a superscalar processor can only be exploited with ILP (single instruction stream parallelism)? And also, can a single instruction stream with good ILP not exploit a hyperthreaded core when compared to a non-hyperthreaded core?
2) Doesn't this also mean that SMT does not in any way help reduce memory latency?
@par. Keep in mind that as stated above Hyperthreading is a specific implementation of hardware-multithreading by Intel.
Absolutely. Intel CPUs feature multiple execution units and the point of hyperthreading is to make two threads worth of instructions (instead of one ) available to core's instruction scheduler to give it a higher chance of finding enough instructions to dispatch to all these units. In other words, there's not enough ILP in one thread to occupy a bunch of ALUs, why not give the scheduler two threads to look at to find the instructions?
No. By the definition of multi-threading and ILP. (This would be a good place to stop and go review those definitions.) Multi-threading is specifically the ability for a processor to maintain multiple execution contexts (state for multiple instruction streams) on the core at once. A single instruction stream with run within one execution context. In other words, a single instruction stream is exactly that --- one thread of control.
@kayvonf Okay, thanks for that clarification! But how does SMT help hide memory latency (mentioned on next slide.)
Suppose you have 2 threads with one thread memory access intensive and another with no memory accesses (very arithmetic intensive.) If I run these two threads on a machine with just 1 Fetch decode, 1 ALU and 2 execution contexts, the utilization will be better than if I was to run this on an SMT machine right?
In the SMT case for this workload, 1 set of fetch decode, alu and execution context will be completely unused during the memory fetch time.
Is there an example of a workload where SMT helps reduce memory latency?