Previous | Next --- Slide 73 of 79
Back to Lecture Thumbnails

Can a block be divided to be executed on different cores when there is no thread dependency inside a block? If the answer is yes, when there is not enough space to accommodate a block on a single core, will the scheduler divide the block into several parts to fill in idle warps across different cores?


In general, if there are too many CUDA threads to fit in a GPU core's warps, is it possible to run the program if there are no dependency issues between threads (i.e. the __syncthreads() deadlock problem shown in the slide), or is the scheduler not smart enough to cycle block threads through the warps as necessary?


I don't believe the thread blocks can be split into mini blocks because independence of the threads within the block cannot be guaranteed. Also, on the next slide, "threads in thread-block are concurrent, cooperating 'workers.'"


In this picture, if there were 8 warps instead of 4 and the number of simultaneous instruction streams remain the same (4) - for a block size of 256 threads, are we allowed to execute __syncthreads() ? In other words, do all threads of a block need to run simultaneously or as long as enough warps are there to store their execution context then its ok? In this case only up to 128 threads can run concurrently.


Does that mean this code will not compile? Because the code needs to run 256 CUDA threads concurrently?


Why do we need 8 warps to run 256 CUDA threads? Since there are 256 threads per block, shouldn't we need 1 block. Where are you getting the 8 warps from?


@maxdecmeridius. A CUDA thread block and a warp are very different concepts. This code definitely specifies that a thread block contains 256 CUDA threads. And the threads in the block definitely map to eight warps when executing on a GPU. You may find slide 56 and slide 71 in this lecture very helpful.


from this slide, block puts its shared data in 'context', which I think is warp in this graph. But since all threads in a block share this data, does it mean all warps belong to this block would have same copy of the data or this data is actually placed somewhere else?


About the "Shared memory", is it just shared support in the convolve code? Also, in this example, blocks can be 'disassembled' into warps. However, is it true that in reality, one block always runs on one core?


Does this imply that the programmer must take care that

num_threads_per_block <= warp_size * Num_warps_per_core

And would we get a compilation error otherwise?


@muscleNerd. You are correct.