Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 41 of 62

paraU

Here, the lecture notes says CUDA will implicit synchronize all the threads, but in the homework description, it says "Note that CUDA kernel's execution on the GPU is asynchronous with the main application thread running on the CPU. Therefore, you will want to place a call to cudaThreadSynchronize following the kernel call to wait for completion of all CUDA work.", which means we need to do the synchronization ourselves.

I'm confused about this.

This comment was marked helpful 0 times.

kayvonf

Both statements are correct.

When a CUDA kernel is called, the kernel is not complete until all threads have terminated. You can think of each kernel launch as a command sent to the GPU. By default, the GPU will process these commands in order. That is, if your program launches K1 and then K2. All K1 threads will complete before any thread from K2 begins execution. Thus you can think of there being a barrier between the threads described by K1 and the threads described by K2.

CUDA however, says nothing about the order of kernel execution compared to execution of the main host thread. When the CUDA launch call returns, it does not mean the kernel is complete. If the main CPU thread wants to know the kernel is complete, cudaThreadSychronize can be used since cudaThreadSynchronize only returns when all prior kernel launches are complete.

This comment was marked helpful 0 times.

rokhinip

So to clarify, a CUDA kernel is the function that being called to run on a GPU. It is a non-blocking call from the CPU code so to check the result of our function call, we need to call cudaThreadSynchronize from our CPU code to synchronize and finish the work of all the thread blocks. This is in contrast to the __syncthreads function which is called within the GPU code in order to synchronize the threads within a single block.

This comment was marked helpful 1 times.

cardiff

On the subject of synchronization constructs, some other atomic functions that help with synchronized access to memory are included as builtins in gcc. See the reference here. These are especially useful in assignment 3.

This comment was marked helpful 0 times.