Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 79 of 81

kayvonf

Question 1: Explain the statement "work assignment to blocks is implemented entirely by the application". In more traditional CUDA programming, what entity is responsible for assigning CUDA thread blocks to GPU resources?

Question 2: Explain the statement "now the programmer's mental model is that all CUDA threads are concurrently running on the GPU at once".

paracon

Answer 1: In the traditional CUDA programming model, way more blocks than what can be accommodated on the GPU are launched. The work assignment is done by the CUDA runtime, the hardware scheduler assigns thread blocks to warps. In this slide, the number of thread blocks that are launched are just enough to saturate the chip(each SM in GTX 1080 can hold a maximum of 2048 active threads and 16 active blocks; there are 20 such SMs). The first thread of each thread block access the global shared variable, and by doing an atomic increment on shared variable, workCounter by the number of threads in the block, the blocks are cooperatively executing from a queue of thread blocks. This is software managed by the programmer and bypasses the hardware controller. The blocks terminate when all N elements have been processed. They keep running(persistent) till there is no more work in queue.

Answer 2: The programmer's mental model is that the CUDA threads will run concurrently as there is enough compute capability in the GTX 1080 to run all the thread blocks simultaneously.

mattlkf

If one is using the "persistent thread" programming style, what is the benefit of requesting many more blocks than the number of SMs on the GPU?

Clearly it would not be optimal to request blocks-to-SMs in a 1:1 ratio (i.e. 20 blocks, one per SM), because one block can only use 48kB of shared memory (see http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities), which would not allow us to make full use of the 96kB of shared memory available on each SM of a GPU such as the 1080.

Also (although I'm not sure about this one), it's possible that scheduling multiple blocks per SM allows the SM to hide the latency of memory accesses by interleaving the execution of different blocks. Is this true?

kayvonf

@mattlkf. Your intuition is correct. Latency hiding. (The programmer may wish to maximize the number of threads used in order to gain the benefits of hardware multi-threading.)