Previous | Next --- Slide 64 of 65
Back to Lecture Thumbnails
yetianx

I did not get what is simultaneous multi-threading? Is it multiple threads on multiple cores?

arjunh

@yetianx: Simultaneous multi-threading (also known as hyper-threading in the Intel context) is a form of multi-threading where a single core chooses multiple instructions from multiple threads to run on its ALU's. This is different from interleaved multi-threading, which is effectively what we discussed here

Here's an example taken from wikipedia to explain the difference:

interleaved multi-threading:

  • Cycle i: an instruction from thread A is issued
  • Cycle i+1: an instruction from thread B is issued
  • Cycle i+2: an instruction from thread C is issued

simultaneous multi-threading:

  • Cycle i : instructions j and j+1 from thread A; instruction k from thread B all simultaneously issued
  • Cycle i+1: instruction j+2 from thread A; instruction k+1 from thread B; instruction m from thread C all simultaneously issued
  • Cycle i+2: instruction j+3 from thread A; instructions m+1 and m+2 from thread C all simultaneously issued
kayvonf

More generally, simultaneous multi-threading involves executing instructions from two threads in parallel on a single core. In class, I mainly described interleaved multi-threading, but notice that there is also simultaneous multi-threading being performend by the NVIDIA GPU on slide 57 Each clock, the chip selects two of the up-to-48 runnable execution contents ("warps") and executes the next instruction in each of these streams. So there is interleaved multi-threading in that the chip interleaves up to 48 contexts, and simultaneous multi-threading in that it chooses two of those contexts to run each clock.

Intel's Hyper-threading implementation makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock. But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all he available units in the core (the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's have two threads available to choose instructions from instead of one!

Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss turns out to be a cache hit!

kayvonf

Question: This is a good comment opportunity: write a definition for one of these terms!

wcrichto

Here's a couple definitions:

  • Multi-core processor: a processor with multiple cores (lol). A core is an independent computing unit with its own set of ALUs/caches/etc.
  • SIMD execution: single instruction multiple data, where we use vector instructions to apply a single instruction across a data set of > 1 elements.
  • Coherent control flow: assuming this is the same as coherent execution, coherent control flow means the same instruction sequence applies to all elements operated on simultaneously
  • Interleaved multithreading: a core emulates running multiple threads by switching between their instruction streams at scheduled intervals
  • Simultaneous multithreading: actually running multiple instruction streams at the same time on one core
jinsikl

The rest of the definitions:

  • memory latency: amount of time for a for a memory request to complete.
  • memory bandwidth: how much memory requests can complete per second. So this is the rate at which memory requests can be serviced.
  • bandwidth bound application: an application that requests memory at a rate that the system can't handle, and thus the performance of the application is bound by the bandwidth.
  • arithmetic intensity: basically how much math is done per memory request. The lower this number, it will be more likely for the application to be bandwidth bound. Think back to saxpy.