Previous | Next --- Slide 50 of 81
Back to Lecture Thumbnails
srb

I don't think I understand why host/device synchronization is notable. Shouldn't this be a given, that the kernel waits for all threads to finish before it returns? Are there examples of similar abstractions where something like a function will launch work and then return without being able to guarantee that that work has been completed?

elephanatic

Is there a difference between calling __syncthreads() inside the kernel(s) and calling cudaThreadSynchronize() in the host code?

POTUS

@elephanatic __syncthreads() waits until all threads within a block reach that point in the code, whereas cudaThreadSynchronize waits for all kernel calls to be complete.

pk267

@srb - yes, a server handling concurrent clients is done in a similar way. Once the connection is established, the server code returns and allows the client to fill up the buffer asynchronously, taking its own sweet time without bothering the server and making the other clients wait in the queue. If the server had to wait, throughput of the system would be seriously affected.

ayy_lmao

@srb - also it might be advantageous to have multiple kernels running at the same time in which case the implicit barrier is important, as if one kernel returns without all the threads in a block finishing, maybe it could pick up threads from a different kernel causing issues.

sampathchanda

It is to be noted that, CUDA provides atomic operations on global memory also. Usually, it is understood that memory access to global memory is time taking and atomic operations on that memory is not possible. This is interesting to know.