Isn't this very similar to what we did on lecture 6, slide 29 .
Yep, it's just dynamic work assignment to a fixed size pool of workers (in this case the workers are CUDA thread blocks). The work assignment is carried out by atomically updating a shared counter.
Question: I'd like to see someone summarize the difference between the scheduling approach adopted by this code here, and the more conventional CUDA convolution implementation given earlier.
Here we are using a single task queue so that each block can take the next available task by incrementing the index by blocksize (total number of threads in a single block). This is a dynamic work assignment, performed by the application (not the CUDA system) to a fixed size pool of blocks. If the number of blocks needed to perform convolution exceeds the fixed size pool of blocks, it would take multiple iterations to get the result.
The earlier implementation performs the convolution in a similar fashion but the size pool of blocks is computed by the size of the array - # of blocks = (length of the array + threads per block - 1 )/(threads per block). Each block is given a portion of the array and performs the computation in parallel and in a single iteration.
So if the number of blocks that are needed to do the convolution is greater than the fixed size pool of blocks we've specified, the earlier implementation would be faster.
What is the maximum number of blocks for GeForce GTX 480?
Let's say GeForce GTX 480 can handle upto 'M' many blocks. If the total number of blocks needed to do the convolution exceeds 'M', how would the GPU assign that block? Would it be a dynamic work assignment or a static work assignment?
@aznshodan: The answer to (1) is in the code as BLOCKS_PER_CHIP. The program creates 15x12=180 blocks which we talked about as the maximum number of blocks that can fit on the 15-480 at once given this kernel definition.
Your number (2) is a great question. What happens if all I do is change the BLOCKS_PER_CHIP to be 15x12+1 instead?
@kayvonf I was reviewing this slides and noticed your question.
I believe that nothing bad would happen here if we set BLOCKS_PER_CHIP to 15*12+1. It would be inefficient though, since the 15*12+1 block will have no work to do once its scheduled. Thread 0 would increment the counter and all of the threads will break out of the while immediately because startingIndex >= N.
This occurs because the 15*12 blocks will be scheduled initially, and will continue running until completion (that is, until startingIndex >= N). The 15*12+1 block cannot be scheduled initially because we launched as many thread blocks to fill the GPU.
@jazzbass. You are correct!
This style of "Persistent threads" is particularly useful in allowing a program to take advantage of locality. For instance, let's say we have a small lookup table which fits into shared memory that contains a mapping from some input set of values to another output set of values. Lets say we have some large array of data A that we want to convert from the input set to the output set. If we took a traditional data parallel approach we would allocate a thread for each element in A, load the lookup table into shared memory for each block of threads, apply the mapping to the elements of A, and then save the mapped elements back out to global memory.
The issue here is that each time we launch and retire a new block of elements we are consuming memory bandwidth to reload our lookup table from global memory. Instead, we could use a "persistent" block that loads the lookup table once and then processes many elements of A via looping and then stops when all elements in A have been exhausted. This reduces our memory bandwidth and contention of the global memory belonging to the lookup table.