Previous | Next --- Slide 52 of 54
Back to Lecture Thumbnails
kayvonf

Question: We've talked about instruction-level parallelism, simultaneous hardware multi-threading, interleaved hardware multi-threading, and SIMD execution so for in the course. Examples of each of these techniques are present in how CUDA programs executes on a GTX 680 SMX core. Anyone care to explain how all these concepts pop up in the execution of a CUDA program on a GTX 680?

Xiao

From the GPU hardware's perspective:

  • Cores running different warps in parallel provides simultaneous multithreading.
  • Each core selecting and running runnable warps provides interleaved multithreading.
  • Threads in each warp runs simultaneously in multiple SIMD lanes.
  • Multiple fetch/decode units per core takes advantage of ILP and tries to run as many instructions in parallel as possible.

From the CUDA program's perspective:

  • Multiple CUDA kernels allocated to different blocks could be running on different cores concurrently, or on the same core interleaved (both types of SMT). Different CUDA kernels could also be ran under this scheme.
  • Multiple CUDA kernels allocated to the same block but to different thread indices will run concurrently on one warp on different SIMD lanes.
  • ILP is "invisible" to the CUDA programmers. But code with good ILP will certainly run faster.
kayvonf

@Xiao: Awesome! Here are a few corrections/clarifications though:

  • There's also simultaneous multi-threading employed within a single SMX core by selecting up to four warps to run in a clock. Those four warps are four independent instruction streams running simultaneously.
  • The multiple fetch/decode blocks serve to exploit thread-level parallelism (up to four warps per clock, as stated above) and also instruction-level parallelism (up to two independent instructions per clock per warp).
  • For clarity let's use the term "CUDA thread" instead of "kernel". A CUDA kernel launch is a call of a device function that corresponds to a logical launch of many CUDA threads. The threads are organized into thread blocks. The thread-block abstraction is a strong locality hint to the GPU. The programmer is hinting to the system that threads in a block are likely to cooperate. GPU implementations use this hint as a signal that it is a good idea co-locate CUDA threads in a thread block on the same SMX core, enabling faster communication and synchronization.