Previous | Next --- Slide 33 of 54
Back to Lecture Thumbnails
jpaulson

__shared__ means shared among every thread in a block.

jpaulson

Note that the spec of this program is output[i] = (input[i] + input[i+1] + input[i+2]) / 3.0f and NOT output[i] = (input[i-1] + input[i] + input[i+1]) / 3.0f

vsiao

Each of the 128 threads takes its input at index and copies it to the shared support array. The first two threads do double-duty to copy all N+2 pieces of input data.

__syncthreads ensures that all of the input data is copied into the shared block memory before each thread computes its output.

tnebel

The structure of this program allows us to minimize the expensive load from main memory, by copying the data to shared memory (in parallel). Then when each thread needs to access this memory it stored in the cheaper shared memory.

jcmacdon

Using the support array shared across the block, we prevent each thread from having to load three array values from input to calculate its convolution output and instead can load each value in input only once.

Thus, for each block we only have to make $n+2$ ($n =$ THREADS_PER_BLK) loads from input instead of $3(n+2) - 6 = 3n$ loads if we had not used support (each of the $n+2$ elements in input would be loaded 3 times, except input[0] and input[n+1] are only loaded once and input[1] and input[n] are only loaded twice, making 6 less loads).

max

The total number of blocks in this piece of code is N / THREADS_PER_BLK

Mayank

What is the actual benefit achieved in terms of time? As @jcmacdon pointed out, we would be doing 3 times more loads from global memory if we did not use shared memory. However, since 2 threads perform 2 loads instead of 1, they also blocks all other threads (due to __syncthreads() barrier). So is the savings in terms of time $3/2$ instead of $3$?

kayvonf

The discussion seems to revolve around THREE possible implementations of convolve program above:

  1. An implementation where shared memory is NOT used at all and each thread issues three loads for the data it needs.
  2. An implementation where thread 0 perform all loads into shared memory for the entire block.
  3. The implementation provided above, which uses all threads in the thread block to cooperatively load the union of all their required data into shared memory. Here, that's 130 elements of the input array.

Let's consider each possible program, and how it might map to a GPU. I will describe the implementations relative to option 3, since we described option three in detail in class, and also since that's the implementation shown above.

Option 1: This option is arguably the simplest option, since each CUDA thread operates independently and makes no use of the synchronization or communication capabilities of threads in a thread block. However, all together, the threads issue three times as many load instructions as in the code above. In the earliest CUDA-capable GPUs, all accesses to global memory were uncached, so three times the number of load instructions also entailed three times more data transferred from memory (ouch!). Modern GPUs feature a modestly sized L1 and L2 cache hierarchy, so the bandwidth cost of issuing three times as many loads as necessary is certainly mitigated since the program will benefit from hits in the caches. It is possible that on one of the GTX 6XXX GPUs, option 1 is competitive with, or perhaps faster than, option 3 on this simple program. Someone should code it up and let us know.

Option 2: The option only issues the required amount of loads, but does so sequentially in a single thread. It will take a significant amount of time to issue all these instructions. Also resenting 130 loads to the chip one-by-one will almost certainly not make good use of the wide memory bus on a GPU that is designed to satisfy many loads from multiple threads at once.

Option 3: This code does have the inefficiency that two threads in a block have to do more work than the others, since 130 is not equally divided by 128. However, the extra work is not as bad as you might think. First, executing 4 instructions (load from global, store to shared, load from global, store to shared) in two CUDA threads and 2 instructions (load from global, stored to shared) is certainly better workload balance than performing 260 instructions sequentially in one thread (130 loads from global and 130 stores to shared) while the others sit idle. Further, keep in mind that only threads assigned ot the same GPU hardware warp (not all 128 threads in a block) execute in SIMD lockstep, so in this example there's divergence in only one of the four warps that make up a block! If you're an astute reader, and are worried about the core going idle (because one warp is blocked on memory for those last two loads and and the other three are waiting at the barrier), I'll point out that slide 52 says that a single core of a GTX 680 GPU can interleave up to 16 blocks (or up to 2,048 CUDA threads, whichever is the limiting number for a program). So the GPU scheduler almost certainly will take my simple program above and assign 16 blocks to each GPU core at once! Warps from these blocks execute interleaved on the core as discussed in Lecture 2. That means that if all warps from one block cannot make progress, there are 15 other blocks (all together totally 60 other warps) for the core to choose from. That's a lot of hardware multi-threading! And a lot of latency hiding capability!

I encourage you to just try coding up the possible solutions and observe what happens on a real GPU!

sfackler

So I wrote up and benchmarked those three options. On the GTX 480, we see this (numbers are seconds to run the kernel 1000 times):

Option 1: 0.092963573
Option 2: 1.251477951
Option 3: 0.086575691

There doesn't seem to be any way to turn off the L2 cache, but we can disable the L1 cache by passing -Xptxas -dlcm=cg to nvcc. With that, there's an enormous difference on option 2:

Option 1: 0.097117766
Option 2: 2.815642568
Option 3: 0.086821925

I would guess that the L1 cache doesn't really matter too much for options 1 and 2 because a warp ends up consuming an entire cache line at once, avoiding the need to ever go back to cache.

On the GTX 680, we see roughly the same performance characteristics:

Option 1: 0.073917068
Option 2: 0.927370628
Option 3: 0.072028765

and without the L1 cache:

Option 1: 0.078198984
Option 2: 2.064859373
Option 3: 0.072251701

So it looks like option 2 is a terrible idea, and option 3 copies give you a ~7-8% performance boost on this kind of operation.

Code available here

kayvonf

Amazing job @sfackler. I am surprised that disabling L1 does not impact option 1 more because there is 3x reuse in the program. What happens if you add a bit more reuse with a 5-wide convolution window (-2,-1, 0, 1, 2)?

sfackler

A 5-wide window bumps the runtimes up to:

Option 1: 0.143365269
Option 2: 1.274452689
Option 3: 0.088349569

with L1 and

Option 1: 0.133055357
Option 2: 2.858435744
Option 3: 0.088058527

without.

A 7-wide window takes:

Option 1: 0.168962534
Option 2: 1.296426201
Option 3: 0.091151181

with L1 and

Option 1: 0.166247257
Option 2: 2.901670890
Option 3: 0.091110048

without.

I updated the code with a WINDOW #define.

kayvonf

Interesting. I cannot explain the lack of sensitivity to enabling/disabling the L1 cache. Perhaps someone from NVIDIA can help.

aakashr

Shouldn't the call to cudaMalloc(&devInput;, N) be cudaMalloc(&devInput;, sizeof(float) * N)?