__shared__ means shared among every thread in a block.
This comment was marked helpful 0 times.
jpaulson
Note that the spec of this program is output[i] = (input[i] + input[i+1] + input[i+2]) / 3.0f
and NOT output[i] = (input[i-1] + input[i] + input[i+1]) / 3.0f
This comment was marked helpful 0 times.
vsiao
Each of the 128 threads takes its input at index and copies it to the shared support array. The first two threads do double-duty to copy all N+2 pieces of input data.
__syncthreads ensures that all of the input data is copied into the shared block memory before each thread computes its output.
This comment was marked helpful 2 times.
tnebel
The structure of this program allows us to minimize the expensive load from main memory, by copying the data to shared memory (in parallel). Then when each thread needs to access this memory it stored in the cheaper shared memory.
This comment was marked helpful 1 times.
jcmacdon
Using the support array shared across the block, we prevent each thread from having to load three array values from input to calculate its convolution output and instead can load each value in input only once.
Thus, for each block we only have to make $n+2$ ($n =$ THREADS_PER_BLK) loads from input instead of $3(n+2) - 6 = 3n$ loads if we had not used support (each of the $n+2$ elements in input would be loaded 3 times, except input[0] and input[n+1] are only loaded once and input[1] and input[n] are only loaded twice, making 6 less loads).
This comment was marked helpful 2 times.
max
The total number of blocks in this piece of code is N / THREADS_PER_BLK
This comment was marked helpful 0 times.
Mayank
What is the actual benefit achieved in terms of time? As @jcmacdon pointed out, we would be doing 3 times more loads from global memory if we did not use shared memory. However, since 2 threads perform 2 loads instead of 1, they also blocks all other threads (due to __syncthreads() barrier). So is the savings in terms of time $3/2$ instead of $3$?
This comment was marked helpful 0 times.
kayvonf
The discussion seems to revolve around THREE possible implementations of convolve program above:
An implementation where shared memory is NOT used at all and each thread issues three loads for the data it needs.
An implementation where thread 0 perform all loads into shared memory for the entire block.
The implementation provided above, which uses all threads in the thread block to cooperatively load the union of all their required data into shared memory. Here, that's 130 elements of the input array.
Let's consider each possible program, and how it might map to a GPU. I will describe the implementations relative to option 3, since we described option three in detail in class, and also since that's the implementation shown above.
Option 1: This option is arguably the simplest option, since each CUDA thread operates independently and makes no use of the synchronization or communication capabilities of threads in a thread block. However, all together, the threads issue three times as many load instructions as in the code above. In the earliest CUDA-capable GPUs, all accesses to global memory were uncached, so three times the number of load instructions also entailed three times more data transferred from memory (ouch!). Modern GPUs feature a modestly sized L1 and L2 cache hierarchy, so the bandwidth cost of issuing three times as many loads as necessary is certainly mitigated since the program will benefit from hits in the caches. It is possible that on one of the GTX 6XXX GPUs, option 1 is competitive with, or perhaps faster than, option 3 on this simple program. Someone should code it up and let us know.
Option 2: The option only issues the required amount of loads, but does so sequentially in a single thread. It will take a significant amount of time to issue all these instructions. Also resenting 130 loads to the chip one-by-one will almost certainly not make good use of the wide memory bus on a GPU that is designed to satisfy many loads from multiple threads at once.
Option 3: This code does have the inefficiency that two threads in a block have to do more work than the others, since 130 is not equally divided by 128. However, the extra work is not as bad as you might think. First, executing 4 instructions (load from global, store to shared, load from global, store to shared) in two CUDA threads and 2 instructions (load from global, stored to shared) is certainly better workload balance than performing 260 instructions sequentially in one thread (130 loads from global and 130 stores to shared) while the others sit idle. Further, keep in mind that only threads assigned ot the same GPU hardware warp (not all 128 threads in a block) execute in SIMD lockstep, so in this example there's divergence in only one of the four warps that make up a block! If you're an astute reader, and are worried about the core going idle (because one warp is blocked on memory for those last two loads and and the other three are waiting at the barrier), I'll point out that slide 52 says that a single core of a GTX 680 GPU can interleave up to 16 blocks (or up to 2,048 CUDA threads, whichever is the limiting number for a program). So the GPU scheduler almost certainly will take my simple program above and assign 16 blocks to each GPU core at once! Warps from these blocks execute interleaved on the core as discussed in Lecture 2. That means that if all warps from one block cannot make progress, there are 15 other blocks (all together totally 60 other warps) for the core to choose from. That's a lot of hardware multi-threading! And a lot of latency hiding capability!
I encourage you to just try coding up the possible solutions and observe what happens on a real GPU!
This comment was marked helpful 2 times.
sfackler
So I wrote up and benchmarked those three options. On the GTX 480, we see this (numbers are seconds to run the kernel 1000 times):
There doesn't seem to be any way to turn off the L2 cache, but we can disable the L1 cache by passing -Xptxas -dlcm=cg to nvcc. With that, there's an enormous difference on option 2:
I would guess that the L1 cache doesn't really matter too much for options 1 and 2 because a warp ends up consuming an entire cache line at once, avoiding the need to ever go back to cache.
On the GTX 680, we see roughly the same performance characteristics:
Amazing job @sfackler. I am surprised that disabling L1 does not impact option 1 more because there is 3x reuse in the program. What happens if you add a bit more reuse with a 5-wide convolution window (-2,-1, 0, 1, 2)?
__shared__
means shared among every thread in a block.This comment was marked helpful 0 times.
Note that the spec of this program is
output[i] = (input[i] + input[i+1] + input[i+2]) / 3.0f
and NOToutput[i] = (input[i-1] + input[i] + input[i+1]) / 3.0f
This comment was marked helpful 0 times.
Each of the 128 threads takes its input at
index
and copies it to the sharedsupport
array. The first two threads do double-duty to copy all N+2 pieces of input data.__syncthreads
ensures that all of the input data is copied into the shared block memory before each thread computes its output.This comment was marked helpful 2 times.
The structure of this program allows us to minimize the expensive load from main memory, by copying the data to shared memory (in parallel). Then when each thread needs to access this memory it stored in the cheaper shared memory.
This comment was marked helpful 1 times.
Using the
support
array shared across the block, we prevent each thread from having to load three array values frominput
to calculate its convolution output and instead can load each value ininput
only once.Thus, for each block we only have to make $n+2$ ($n =$
THREADS_PER_BLK
) loads frominput
instead of $3(n+2) - 6 = 3n$ loads if we had not usedsupport
(each of the $n+2$ elements ininput
would be loaded 3 times, exceptinput[0]
andinput[n+1]
are only loaded once andinput[1]
andinput[n]
are only loaded twice, making 6 less loads).This comment was marked helpful 2 times.
The total number of blocks in this piece of code is N / THREADS_PER_BLK
This comment was marked helpful 0 times.
What is the actual benefit achieved in terms of time? As @jcmacdon pointed out, we would be doing 3 times more loads from global memory if we did not use shared memory. However, since 2 threads perform 2 loads instead of 1, they also blocks all other threads (due to
__syncthreads()
barrier). So is the savings in terms of time $3/2$ instead of $3$?This comment was marked helpful 0 times.
The discussion seems to revolve around THREE possible implementations of
convolve
program above:Let's consider each possible program, and how it might map to a GPU. I will describe the implementations relative to option 3, since we described option three in detail in class, and also since that's the implementation shown above.
Option 1: This option is arguably the simplest option, since each CUDA thread operates independently and makes no use of the synchronization or communication capabilities of threads in a thread block. However, all together, the threads issue three times as many load instructions as in the code above. In the earliest CUDA-capable GPUs, all accesses to global memory were uncached, so three times the number of load instructions also entailed three times more data transferred from memory (ouch!). Modern GPUs feature a modestly sized L1 and L2 cache hierarchy, so the bandwidth cost of issuing three times as many loads as necessary is certainly mitigated since the program will benefit from hits in the caches. It is possible that on one of the GTX 6XXX GPUs, option 1 is competitive with, or perhaps faster than, option 3 on this simple program. Someone should code it up and let us know.
Option 2: The option only issues the required amount of loads, but does so sequentially in a single thread. It will take a significant amount of time to issue all these instructions. Also resenting 130 loads to the chip one-by-one will almost certainly not make good use of the wide memory bus on a GPU that is designed to satisfy many loads from multiple threads at once.
Option 3: This code does have the inefficiency that two threads in a block have to do more work than the others, since 130 is not equally divided by 128. However, the extra work is not as bad as you might think. First, executing 4 instructions (load from global, store to shared, load from global, store to shared) in two CUDA threads and 2 instructions (load from global, stored to shared) is certainly better workload balance than performing 260 instructions sequentially in one thread (130 loads from global and 130 stores to shared) while the others sit idle. Further, keep in mind that only threads assigned ot the same GPU hardware warp (not all 128 threads in a block) execute in SIMD lockstep, so in this example there's divergence in only one of the four warps that make up a block! If you're an astute reader, and are worried about the core going idle (because one warp is blocked on memory for those last two loads and and the other three are waiting at the barrier), I'll point out that slide 52 says that a single core of a GTX 680 GPU can interleave up to 16 blocks (or up to 2,048 CUDA threads, whichever is the limiting number for a program). So the GPU scheduler almost certainly will take my simple program above and assign 16 blocks to each GPU core at once! Warps from these blocks execute interleaved on the core as discussed in Lecture 2. That means that if all warps from one block cannot make progress, there are 15 other blocks (all together totally 60 other warps) for the core to choose from. That's a lot of hardware multi-threading! And a lot of latency hiding capability!
I encourage you to just try coding up the possible solutions and observe what happens on a real GPU!
This comment was marked helpful 2 times.
So I wrote up and benchmarked those three options. On the GTX 480, we see this (numbers are seconds to run the kernel 1000 times):
There doesn't seem to be any way to turn off the L2 cache, but we can disable the L1 cache by passing
-Xptxas -dlcm=cg
tonvcc
. With that, there's an enormous difference on option 2:I would guess that the L1 cache doesn't really matter too much for options 1 and 2 because a warp ends up consuming an entire cache line at once, avoiding the need to ever go back to cache.
On the GTX 680, we see roughly the same performance characteristics:
and without the L1 cache:
So it looks like option 2 is a terrible idea, and option 3 copies give you a ~7-8% performance boost on this kind of operation.
Code available here
This comment was marked helpful 5 times.
Amazing job @sfackler. I am surprised that disabling L1 does not impact option 1 more because there is 3x reuse in the program. What happens if you add a bit more reuse with a 5-wide convolution window (-2,-1, 0, 1, 2)?
This comment was marked helpful 0 times.
A 5-wide window bumps the runtimes up to:
with L1 and
without.
A 7-wide window takes:
with L1 and
without.
I updated the code with a
WINDOW
#define.This comment was marked helpful 0 times.
Interesting. I cannot explain the lack of sensitivity to enabling/disabling the L1 cache. Perhaps someone from NVIDIA can help.
This comment was marked helpful 0 times.
Shouldn't the call to
cudaMalloc(&devInput;, N)
becudaMalloc(&devInput;, sizeof(float) * N)
?This comment was marked helpful 0 times.