Previous | Next --- Slide 39 of 81
Back to Lecture Thumbnails
Qiaoyu

Could I think the Grid as a matrix, and Blocks as first level of matrix partitioning from Grid to (rowBlock, columnBlock, depthBlock?) Blocks(So there will be rowBlock * columnBlock * depthBlock blocks per grid), and Threads as second level of matrix partitioning from Blocks to (rowThread, columnThread, depthThread) Threads(So there will be rowThread * columnThread * depthThread threads per block)?
So, the total number of threads would be (rowBlock * columnBlock * depthBlock * rowThread * columnThread * depthThread) threads in total? That is (2 * 3 * 1 * 3 * 4 * 1) threads in the picture above?

firebb

I think we could also think it as Nx * Ny = 72 threads. We partition it into 23 = 6 blocks and each blocks contains 34 = 12 threads. This explains the code. Note that in the code the block and numblocks are defined in 3D, but the last dimension is 1.

cwchang

From this tutorial, both numBlocks and threadsPerBlock can be either a 3 dimension vector or just an int. For example, dim3(w, 1, 1) equals to w. I think using just an integer to think about the number of threads per block and the number of blocks is more intuitive. After having some sense, thinking about 3 dimensional numBlocks or threadsPerBlock will be much easier.

RomanArena

In the slides, it assumes that A,B,C are allocated Nx * Ny float arrays, so we can think the CUDA program allocates 72 concurrent threads to do the matrixAdd function.

username

Would changing the dimensions of threads per block to let's say 6,2,1 or 12,1,1 have any affect on the performance of the code? Should we try to keep the dimensions as close to squares as possible?

kapalani

How did we decide 6 blocks with 12 threads each as opposed to some other configuration (like 9 blocks with 8 threads each)? Does it have to do with how much local memory each thread consumes so as to make sure the the block of threads are able to effectively use the shared block of memory? For example having one block of 72 threads could mean that the shared data for all 72 threads can no longer fit in the shared memory for one block