Previous | Next --- Slide 31 of 54
Back to Lecture Thumbnails
Mayank

Question: If we wish to achieve barrier synchronization between different blocks, then do we have to break the work as different kernel functions?

briandecost

@Mayank -- I'd say yes: since there's no guarantee about the order in which kernel blocks are run, it doesn't make much sense to have one block stop in the middle of a kernel function and wait for another block (or all other blocks) to catch up.

Edit: after some searching around, I found this paper that demonstrates that it's possible to synchronize between blocks in a way that boosts performance, at least for some problems. I don't think there's an official way to synchronize between blocks other than using multiple kernels (remember that GPUs were designed for handling independent data), and it's probably still not a good idea unless you really know what you're doing.

kayvonf

Keep in mind the following statement in Section 5 of the paper referenced by @briandecost:

"Our solution to this problem is to have an one-to-one mapping between thread blocks and SMs. In other words, for a GPU with "Y" SMs, we ensure that at most "Y" blocks are used in the kernel. In addition, we allocate all available shared memory on a SM to each block so that no two blocks can be scheduled to the same SM because of the memory constraint.

This is exactly what I describe in slide 49. Since CUDA gives you global synchronization primitives that work across blocks, it is possible to use those primitives to build inter-block communication like how I do on slides 47 and 48.

However, we must stay mindful of the difference between CUDA's semantics and it's implementation on GPUs. It's not a question of official vs. unofficial, it's simply a question whether the programmer is willing to make assumptions about the implementation of the GPU's block scheduler or not. In the paper @briandecost cites, as well as in the example on slide 49, the program is written assuming that if N thread blocks are launched on a a GPU with N cores, the GPU will run all N tasks simultaneously on all the cores. This is an assumption about CUDA's implementation. It happens to be a correct assumption for current NVIDIA GPUs, but CUDA does not specify this behavior. If NVIDIA ever wanted to change this implementation in future GPUs, and say, use only half the GPU's cores to run the program, they are certainly allowed to do so (the implementation would still be valid according to CUDA's semantics). However, this implementation change would cause these programs to deadlock.

In spirit, a CUDA program defines a large stream of thread blocks, and those blocks are dynamically assigned to cores. In this manner, it makes sense to think of the work done by a thread block as a sub-problem, and its advisable for the programmer to break of problem into many more sub-problems than cores in order to achieve an even distribution of work onto the cores. If a programmer instead chooses to create exactly as many thread blocks as cores, the programmer is no longer thinking about splitting up work into subproblems, he/she is thinking about spawning a bunch of endlessly running low-level workers. The program takes ownership of manually assigning work to these workers (using global synchronization primitives to facilitate coordination). Essentially, it's like treating a thread block very much like a pthread.