Previous | Next --- Slide 43 of 62
Back to Lecture Thumbnails
mchoquet

It seems like we could replace the loading phase with:

int start_idx = blockIdx.x * blockDim.x;
support[threadIdx.x] = input[start_idx + threadIdx.x];
if (threadIdx.x >= THREADS_PER_BLK - 2) {
    support[THREADS_PER_BLK] = input[start_idx + THREADS_PER_BLK];
    support[THREADS_PER_BLK + 1] = input[start_idx + THREADS_PER_BLK + 1];

This adds two extra loads from memory, but I think it makes the __syncthreads() call unneeded, since the two threads that depend on the last two values load them themselves before continuing. Is this true, and if so would making this change be worth it?

spilledmilk

The __syncthreads() call is necessary to ensure that the entire support array has been loaded before continuing. Without the synchronization, it is possible for a thread to reach the addition portion of the code without the other threads that it relies on (namely the threads with IDs threadIdx.x + 1 and threadIdx.x + 2) having finished the loading portion yet, and therefore, the addition would add undefined values. The __syncthreads() call removes the possibility of this race condition.

jmnash

This code is similar to the way ISPC is coded because the function convolve itself does not itself show any parallelism (other than the call to __syncthreads, which I will come to later). Instead, it is the sequential code for just one thread in one block. The parallelism is invoked in the call to convolve, with the CUDA parameters <<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>. This is similar to using launch in ISPC.

As for the call to __syncthreads, I don't recall ever seeing any barriers in ISPC programs. Are they also needed and we just didn't see any/I just don't remember it? Or is it that ISPC doesn't need them since it takes care of that stuff for you, and CUDA has a lower abstraction distance than ISPC?

kkz

@jmnash I believe we didn't see barriers in ISPC programs because it was given that the gangs were independent.

eatnow

Assuming my understanding is correct, each instance in an ISPC gang is by definition synchronized, due to the semantics mentioned here

"ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement."

Hence barriers are unnecessary between instances of a gang. As for tasks, my guess is that barriers across ISPC tasks is altogether impossible, since tasks may be run one by one sequentially, and not interleaved.

yetianx

Why it loads data first into a block shared array? Why not directly calculate the result from the input array?

yixinluo

@yetianx Good question! If you look ahead to slide 48, on-chip shared memory is much faster than global memory [reference]. In the 1D convolution example above, you will access each data elements three times per block (spatial locality). If you move the array to block shared memory, you can benefit from this locality such that you don't need to fetch each array element three times from global memory.

tcz

The code only uses the x dimension of the thread and block indices. It seems to me that every row of blocks being utilized in this example would calculate the same thing, as would every row of threads in a block. Why don't we use the other (y) dimension here? Certainly using only x helps illustrate the example better, but is there a more profound reason?

kayvonf

@tcz: This example is a 1D convolution. So there are not multiple rows.

The kernel launch creates N/THREADS_PER_BLOCK blocks, with THREADS_PER_BLOCK threads per block . Since I initialize the launch with 1D values blockDim.y is 0.

LilWaynesFather

Why can we initialize the devInput in the host code? I thought we couldn't touch the handles/pointers that we pass to the kernal functions. Otherwise why do we need to move data between the host and device address spaces like in slide 39?

kayvonf

@LilWaynes: Your understanding is correct. The "properly initialize" comment referred to issuing the appropriate cudaMemcpy call to initialize the device array devInput.