Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 43 of 62

mchoquet

It seems like we could replace the loading phase with:

int start_idx = blockIdx.x * blockDim.x;
support[threadIdx.x] = input[start_idx + threadIdx.x];
if (threadIdx.x >= THREADS_PER_BLK - 2) {
    support[THREADS_PER_BLK] = input[start_idx + THREADS_PER_BLK];
    support[THREADS_PER_BLK + 1] = input[start_idx + THREADS_PER_BLK + 1];

This adds two extra loads from memory, but I think it makes the __syncthreads() call unneeded, since the two threads that depend on the last two values load them themselves before continuing. Is this true, and if so would making this change be worth it?

This comment was marked helpful 0 times.

spilledmilk

The __syncthreads() call is necessary to ensure that the entire support array has been loaded before continuing. Without the synchronization, it is possible for a thread to reach the addition portion of the code without the other threads that it relies on (namely the threads with IDs threadIdx.x + 1 and threadIdx.x + 2) having finished the loading portion yet, and therefore, the addition would add undefined values. The __syncthreads() call removes the possibility of this race condition.

This comment was marked helpful 2 times.

jmnash

This code is similar to the way ISPC is coded because the function convolve itself does not itself show any parallelism (other than the call to __syncthreads, which I will come to later). Instead, it is the sequential code for just one thread in one block. The parallelism is invoked in the call to convolve, with the CUDA parameters <<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>. This is similar to using launch in ISPC.

As for the call to __syncthreads, I don't recall ever seeing any barriers in ISPC programs. Are they also needed and we just didn't see any/I just don't remember it? Or is it that ISPC doesn't need them since it takes care of that stuff for you, and CUDA has a lower abstraction distance than ISPC?

This comment was marked helpful 0 times.

kkz

@jmnash I believe we didn't see barriers in ISPC programs because it was given that the gangs were independent.

This comment was marked helpful 0 times.

eatnow

Assuming my understanding is correct, each instance in an ISPC gang is by definition synchronized, due to the semantics mentioned here

"ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement."

Hence barriers are unnecessary between instances of a gang. As for tasks, my guess is that barriers across ISPC tasks is altogether impossible, since tasks may be run one by one sequentially, and not interleaved.

This comment was marked helpful 0 times.

yetianx

Why it loads data first into a block shared array? Why not directly calculate the result from the input array?

This comment was marked helpful 0 times.

yixinluo

@yetianx Good question! If you look ahead to slide 48, on-chip shared memory is much faster than global memory [reference]. In the 1D convolution example above, you will access each data elements three times per block (spatial locality). If you move the array to block shared memory, you can benefit from this locality such that you don't need to fetch each array element three times from global memory.

This comment was marked helpful 1 times.

tcz

The code only uses the x dimension of the thread and block indices. It seems to me that every row of blocks being utilized in this example would calculate the same thing, as would every row of threads in a block. Why don't we use the other (y) dimension here? Certainly using only x helps illustrate the example better, but is there a more profound reason?

This comment was marked helpful 0 times.

kayvonf

@tcz: This example is a 1D convolution. So there are not multiple rows.

The kernel launch creates N/THREADS_PER_BLOCK blocks, with THREADS_PER_BLOCK threads per block . Since I initialize the launch with 1D values blockDim.y is 0.

This comment was marked helpful 0 times.

LilWaynesFather

Why can we initialize the devInput in the host code? I thought we couldn't touch the handles/pointers that we pass to the kernal functions. Otherwise why do we need to move data between the host and device address spaces like in slide 39?

This comment was marked helpful 0 times.

kayvonf

@LilWaynes: Your understanding is correct. The "properly initialize" comment referred to issuing the appropriate cudaMemcpy call to initialize the device array devInput.

This comment was marked helpful 0 times.