Previous | Next --- Slide 48 of 79
Back to Lecture Thumbnails
kayvonf

Question: Why do you think __syncthreads is only a barrier across threads in a thread block?

bojianh

I do not quite get this question? What does mean by __syncthreads is only a barrier? Does it mean __syncthreads is capable of only acting as a barrier?

Anyway, __syncthreads definitely prevents threads from reading/writing until all the threads have arrived at that instruction.

kayvonf

A __syncthreads operation is a barrier across all threads in a single thread block. It is not a barrier across all threads created in a kernel launch. Why do you think that's the case?

bojianh

Oh I see what you mean. The execution of the thread blocks are independent of each other, hence, they can be either executed on the same core at different times or on different cores at the same/different times.

kayvonf

@bojianh: And how is this detail (which is correct) relevant to the question at hand?

bojianh

Stalling of all the threads in the block that executed first in that core and unable to run to completion while the GPU scheduler is unable to yank out these threads and schedule others. (No pre-emption.)

Deadlock:

  1. Mutual exclusion <- Each execution unit in the core is not shareable by multiple threads are the same time

  2. Hold-wait <- The threads are holding onto the execution unit

  3. No pre-emption <- Every thread in the block must run to completion before swapped out

  4. Circular dependency <- The threads depend on the sync_thread of all other threads in the kernel launch.

rajul

Like @bokianh says, __syncthreads cannot act as a barrier across thread blocks because not all blocks are scheduled at once and we would have a deadlock as the current thread would be waiting on the thread belonging to another block to complete but that thread has not been scheduled yet. Also the GPU scheduler does not prempt threads.

The scheduler works with the assumptions that there are no dependencies between threads in 2 blocks.

jellybean

I believe that by having many more blocks than cores, the work will be load-balanced, since blocks will go to the next available core once the previous block has completed.