Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 34 of 62

mamini

I was looking at the code above, and thought that there has to be a wrapper function for the part that's below ///// line. In other words, starting from dim3 threadsPerBlock (..) .. till the Kernel call has to be within a wrapper function. But it's not shown here in the slides. Is that right?

I am not sure what dim3 means. In the assign. base code, number of blocks and number of threads are given as constants. How does that differ from this?

This comment was marked helpful 0 times.

ron

If we're thinking in C/C++, the code below ///// would probably be included in your main() function.

According to this useful link, dim3 is a vector of 3 elements (x, y, and z), so threadsPerBlock(4,3,1) initializes the threadsPerBlock vector with x=4,y=3,z=1. Notice that in the example above, thread indexes (threadIdx) and block indexes (blockIdx) also have 3 components: an x, y, and z, rather than the usual single-valued index. By specifying that the number of threads in a block is 4x3x1, we tell CUDA not only that there are 4x3x1=12 threads in a block, but also that the threads in each block should have indices that span 4 x-values, 3 y-values, and 1 z-value. For example the 12 threads in the first block will have the following x-y-z indices:

(0,0,0), (1,0,0), (2,0,0), (3,0,0)
(0,1,0), (1,1,0), (2,1,0), (3,1,0)
(0,2,0), (1,2,0), (2,2,0), (3,2,0)

Similarly, the block indices are in a Nx/threadsPerBlock.x by Ny/threadsPerBlock.y by 1 array.

This is helpful for indexing into data when the data being processed in parallel has more than one dimension; for example a 12x3 element 2-D array, where you want each block of threads to process a 4x3 section of that array. The previous slide illustrates this quite well.

This comment was marked helpful 1 times.

adsmith

I thought CUDA execution was asynchronous with the CPU threads. If the call to matrixAdd doesn't return until all threads have terminated, why do we need cudaThreadSynchronize?

This comment was marked helpful 0 times.

kayvonf

@adsmith. Your understanding is correct. The call to matrixAdd may return prior to the GPU completing the work associated with the launch call. A more lengthly discussion is here.

This comment was marked helpful 0 times.