Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 35 of 54

afa4

Is there an explicit synchronization construct (like __syncthreads()) which waits for all threads in all blocks?

Update: slide 40 says that CUDA assumes that thread block execution can be carried out in any order. So I am guessing such a construct is not required.

arjunh

@afa4 cudaThreadSynchronize is probably what you're looking for; it waits for the entire kernel (which comprises of thread-blocks) to finish execution before another kernel can be launched.

kayvonf

@afa4: You bring up a good question. A barrier that worked across-all-thread-blocks-in-a-single-launch would be very inefficient to implement and in fact violate the model of CUDA programming. I suspect you would never see it's inclusion in CUDA, unless the language changed dramatically.
Question: Why is this the case?

afa4

I think it would be very inefficient to implement a barrier that worked across-all-thread-blocks-in-a-single-launch because blocks are non-preemptable. Once a block completes execution till the barrier it will have to wait for other blocks to reach the barrier point. And while the completed block is waiting no other block can be scheduled on that core.

kayvonf

Great! And if the implementation has no ability to preempt thread blocks, the global barrier is not merely inefficient, its use causes... ??????

afa4

I guess it could cause a deadlock if the number of thread blocks is more than the number of cores?

kayvonf

Correct!