Previous | Next --- Slide 49 of 54
Back to Lecture Thumbnails
joe

This solution does not seem ideal to me. Imagine one block gets stuck with several particularly lengthy threads, it will slow your entire program down. By having more blocks than cores, the cores that finish their blocks first can grab another one and work on it instead of waiting for the slow core to finish.

kayvonf

@joe: In the code above, the user has essentially implemented a work queue. Notice how each thread block iterates in a while(1) loop, continually grabbing the next batch of work. Overall, there are N output elements to compute and each batch computes 128 elements. Once the threads in the thread block have cooperatively completed processing for this batch of 128 elements, the block then grabs a new batch. When the shared variable workCounter reaches N all work is done and the threads in the thread block terminate.

The difference between this implementation on the one shown on slide 033 hinges on who is responsible for assignment of work to threads. Here, the programmer has launched exactly as many blocks as can run concurrently on the GPU and assignment of work to these blocks is carried out manually through synchronized access/modification of the workCounter variable. As a result, this code makes very strong assumptions about the GPU it is run on, and about how the CUDA implementation schedules thread blocks onto GPU cores. In contrast, on slide 33, the programmer created many more blocks than could fit on the machine simultaneously and he/she is relying on the CUDA system to make the assignment of blocks to cores.

A similar question about who takes responsibility for assignment (the programmer or the system) was asked in the context of ISPC programming in the Parallel Programming Basics lecture on slides 002 and 003.

martin

Can someone clarify for me why BLOCKS_PER_CHIP for the GTX 480 GPU is ideally 15*12 here?

kayvonf

@martin: The GTX 480 GPU has 15 SMs (cores) and each SM can interleave execution of up to 1536 CUDA threads (48 warps). Since the thread block size of the convolve program is 128 CUDA threads (mapping to 4 warps), 12 blocks per core can be run at a time. (This is because the core can interleave up to total of 1526 CUDA threads, and 128 * 12 = 1536.) Therefore, at most 15 * 12 CUDA thread blocks can be executed by the GPU concurrently.

The program above is written so that each thread block executes for the entire duration of the program. To achieve this behavior, the code makes specific assumptions about the number of cores in the GPU, the capabilities of those cores, and how thread blocks are scheduled onto the GPU. This program will only work as desired on a GTX 480 GPU. This is in contrast to most CUDA programs which are oblivious to most GPU implementation details: a more conventional way to write CUDA code is to create many more thread blocks than can be executed at once by the GPU and rely on the GPU scheduler to dynamically assign blocks to cores over time.

gbarboza

Can you elaborate on the statement that "since the thread block size of the convolve program is ..." It's unclear to me why this allows the device to run 12 blocks per core. I thought that a core could only run one block at a time.

sfackler

Threads access workCounter via atomicInc:

int atomicInc(int *val, int amount)

where

x = atomicInc(& foo, y);

is conceptually equivalent to

atomically {
    x = foo;
    foo += y;
}

Here only one thread (thread 0) in each thread block calls atomicInc and increments the counter by the number of threads in the block.

Aside: It's interesting to think about what happens if each thread calls atomicInc, so each thread increments workCounter and gets a unique value from it. Remember that every thread in a warp calls each instruction at the same time. The CUDA spec says that when a warp calls atomicInc, the order that the threads "appear" to have called it in is undefined. That is, if workCounter is 0, and some warp calls x = atomicInc(& workCounter, 1), each thread in that warp will have a different value in x, ranging from 0 to 31, but we can't know ahead of time which thread has which value.

kayvonf

@gbarboza. Multiple CUDA thread blocks can be scheduled onto a single GPU core and executed concurrently provided sufficient resources (e.g., thread execution contexts and shared memory space) exist on the core. A GTX 480 SM core can run up to 12 CUDA thread blocks at once, provided the total number of threads summed over all blocks on the core does not exceed 1536.

sfackler

There's some info on how CUDA allocates thread blocks here.