Previous | Next --- Slide 5 of 46
Back to Lecture Thumbnails
smklein

Can someone describe why this kernel code deadlocks?

eatnow

If a group of threads have to run to completion before the next group begins, then there will be a deadlock, because the first group will never complete due to __syncthreads() not returning. __syncthreads() will not return because the other threads will never get to run.

kayvonf

Specifically, there is no preemption of CUDA threads. Once a CUDA thread is assigned an execution slot on the hardware, there is no mechanism to relinquish the slot until the thread terminates.

kkz

@smklein I think it's not necessarily that the code produces a logical deadlock, but that certain implementations will result in a deadlock (if the threads are not processed concurrently).

analysiser

I think the deadlock hazard here is if the first group of threads that parallel executed in kernel does not have the threadIdx.x smaller than two (which is weird to me), then all threads are waiting for the assignment of support array for threadIdx.x < 2 case, which will never happen unless this group of threads finished executing. That's why there is a dependency between threads in a block.

analysiser

But then I have a question about: even though the threads are divided to two groups to execute, isn't that for each of the group there would be at least a thread whose idx.x must be less than 2? Therefore there should not be a deadlock at all. I would feel if the if (threadidx.x < 2) is changed to (index < 2), there would be a potential deadlock due to thread dependencies...

Even, I am not too clear as to why the above code would deadlock. This is because the only assignments happening before the __syncthreads() are using the input array which is never modified by any other thread block so I am not able to exactly figure out why there would be a dependency leading to a deadlock.
I don't think this code deadlocks - the code here illustrates the case where there is no deadlock because __syncthreads() is between threads in the same block, and all the threads in a block run concurrently. So, each thread in the block has the opportunity to run and reach the barrier.