This adds two extra loads from memory, but I think it makes the __syncthreads() call unneeded, since the two threads that depend on the last two values load them themselves before continuing. Is this true, and if so would making this change be worth it?
This comment was marked helpful 0 times.
spilledmilk
The __syncthreads() call is necessary to ensure that the entire support array has been loaded before continuing. Without the synchronization, it is possible for a thread to reach the addition portion of the code without the other threads that it relies on (namely the threads with IDs threadIdx.x + 1 and threadIdx.x + 2) having finished the loading portion yet, and therefore, the addition would add undefined values. The __syncthreads() call removes the possibility of this race condition.
This comment was marked helpful 2 times.
jmnash
This code is similar to the way ISPC is coded because the function convolve itself does not itself show any parallelism (other than the call to __syncthreads, which I will come to later). Instead, it is the sequential code for just one thread in one block. The parallelism is invoked in the call to convolve, with the CUDA parameters <<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>. This is similar to using launch in ISPC.
As for the call to __syncthreads, I don't recall ever seeing any barriers in ISPC programs. Are they also needed and we just didn't see any/I just don't remember it? Or is it that ISPC doesn't need them since it takes care of that stuff for you, and CUDA has a lower abstraction distance than ISPC?
This comment was marked helpful 0 times.
kkz
@jmnash I believe we didn't see barriers in ISPC programs because it was given that the gangs were independent.
This comment was marked helpful 0 times.
eatnow
Assuming my understanding is correct, each instance in an ISPC gang is by definition synchronized, due to the semantics mentioned here
"ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement."
Hence barriers are unnecessary between instances of a gang. As for tasks, my guess is that barriers across ISPC tasks is altogether impossible, since tasks may be run one by one sequentially, and not interleaved.
This comment was marked helpful 0 times.
yetianx
Why it loads data first into a block shared array?
Why not directly calculate the result from the input array?
This comment was marked helpful 0 times.
yixinluo
@yetianx Good question! If you look ahead to slide 48, on-chip shared memory is much faster than global memory [reference]. In the 1D convolution example above, you will access each data elements three times per block (spatial locality). If you move the array to block shared memory, you can benefit from this locality such that you don't need to fetch each array element three times from global memory.
This comment was marked helpful 1 times.
tcz
The code only uses the x dimension of the thread and block indices. It seems to me that every row of blocks being utilized in this example would calculate the same thing, as would every row of threads in a block. Why don't we use the other (y) dimension here? Certainly using only x helps illustrate the example better, but is there a more profound reason?
This comment was marked helpful 0 times.
kayvonf
@tcz: This example is a 1D convolution. So there are not multiple rows.
The kernel launch creates N/THREADS_PER_BLOCK blocks, with THREADS_PER_BLOCK threads per block . Since I initialize the launch with 1D values blockDim.y is 0.
This comment was marked helpful 0 times.
LilWaynesFather
Why can we initialize the devInput in the host code? I thought we couldn't touch the handles/pointers that we pass to the kernal functions. Otherwise why do we need to move data between the host and device address spaces like in slide 39?
This comment was marked helpful 0 times.
kayvonf
@LilWaynes: Your understanding is correct. The "properly initialize" comment referred to issuing the appropriate cudaMemcpy call to initialize the device array devInput.
It seems like we could replace the loading phase with:
This adds two extra loads from memory, but I think it makes the
__syncthreads()
call unneeded, since the two threads that depend on the last two values load them themselves before continuing. Is this true, and if so would making this change be worth it?This comment was marked helpful 0 times.
The
__syncthreads()
call is necessary to ensure that the entiresupport
array has been loaded before continuing. Without the synchronization, it is possible for a thread to reach the addition portion of the code without the other threads that it relies on (namely the threads with IDsthreadIdx.x + 1
andthreadIdx.x + 2
) having finished the loading portion yet, and therefore, the addition would add undefined values. The__syncthreads()
call removes the possibility of this race condition.This comment was marked helpful 2 times.
This code is similar to the way ISPC is coded because the function
convolve
itself does not itself show any parallelism (other than the call to__syncthreads
, which I will come to later). Instead, it is the sequential code for just one thread in one block. The parallelism is invoked in the call toconvolve
, with the CUDA parameters<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>
. This is similar to usinglaunch
in ISPC.As for the call to
__syncthreads
, I don't recall ever seeing any barriers in ISPC programs. Are they also needed and we just didn't see any/I just don't remember it? Or is it that ISPC doesn't need them since it takes care of that stuff for you, and CUDA has a lower abstraction distance than ISPC?This comment was marked helpful 0 times.
@jmnash I believe we didn't see barriers in ISPC programs because it was given that the gangs were independent.
This comment was marked helpful 0 times.
Assuming my understanding is correct, each instance in an ISPC gang is by definition synchronized, due to the semantics mentioned here
"ISPC guarantees that program instances synchronize after each program sequence point, which is roughly after each statement."
Hence barriers are unnecessary between instances of a gang. As for tasks, my guess is that barriers across ISPC tasks is altogether impossible, since tasks may be run one by one sequentially, and not interleaved.
This comment was marked helpful 0 times.
Why it loads data first into a block shared array? Why not directly calculate the result from the input array?
This comment was marked helpful 0 times.
@yetianx Good question! If you look ahead to slide 48, on-chip shared memory is much faster than global memory [reference]. In the 1D convolution example above, you will access each data elements three times per block (spatial locality). If you move the array to block shared memory, you can benefit from this locality such that you don't need to fetch each array element three times from global memory.
This comment was marked helpful 1 times.
The code only uses the
x
dimension of the thread and block indices. It seems to me that every row of blocks being utilized in this example would calculate the same thing, as would every row of threads in a block. Why don't we use the other (y
) dimension here? Certainly using onlyx
helps illustrate the example better, but is there a more profound reason?This comment was marked helpful 0 times.
@tcz: This example is a 1D convolution. So there are not multiple rows.
The kernel launch creates
N/THREADS_PER_BLOCK
blocks, withTHREADS_PER_BLOCK
threads per block . Since I initialize the launch with 1D valuesblockDim.y
is 0.This comment was marked helpful 0 times.
Why can we initialize the
devInput
in the host code? I thought we couldn't touch the handles/pointers that we pass to the kernal functions. Otherwise why do we need to move data between the host and device address spaces like in slide 39?This comment was marked helpful 0 times.
@LilWaynes: Your understanding is correct. The "properly initialize" comment referred to issuing the appropriate
cudaMemcpy
call to initialize the device arraydevInput
.This comment was marked helpful 0 times.