Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Perf Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 60 of 70

xiaoguaz

Here I suddenly have a question about CUDA. I think the CUDA thread is non-preemptive. If there are not enough resources to run all the CUDA threads together, how can they realize the __syncthreads()?

kayvonf

@xiaoguaz: Please see the GPU programming lecture, slide 56 and slide 73.

xiaoguaz

Thank you, prof. Kayvonf. After referring the discussion in former slides, I find that one block in GPU is running under groups of 32-wide SIMD. For GTX980, there can be 1024 threads running within one block at the same time. So these threads on the same block can be synchronized. Threads on different blocks cannot be synchronized and may not be running in parallel.

kayvonf

@xaioguaz: To state things very clearly. The threads in CUDA thread block have the ability to synchronize via __snycthreads and to communicate via per-bloc shared memory. Threads in a thread block are executed on the GPU in groups of 32 threads called warps. All the threads in a warp share an instruction stream, and a warp is executed using SIMD processing.

CUDA limits the maximum number of threads in a thread block to 1024 (32 warps). However a GTX980 core is capable of maintaining state for up to 2048 threads (64 warps) per core. As a result, it is possible to maintain 2 blocks of 1024 threads on a GTX980 core at the same time. The core will interleave execution of all of these threads.