Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Previous | Next --- Slide 47 of 79

haibinl

This slide illustrates the advantage of block shared memory. Block shared memory is pretty fast compared to global shared memory, a good analogy would be the speed of the L1 cache in a normal CPU. In this example, each of the thread in the block will do just one load instruction from global memory(except that thread #1 and thread #2 do 2 memory load instructions), which totals to 130 memory load instructions in total. The latter operations read from block shared memory and are pretty fast. Meanwhile in the previous example without block shared memory, each thread will do 3 load instructions from memory, which is 3 times slower than the performance of this slide.

huehue

We reuse elements of the input array when computing the output array. Most input elements are used in 3 separate output computations, so it makes sense to take advantage of block shared memory (in this slide) rather than to let each thread compute an output element independently (like in the previous slide). This way, we only need to load those reused elements once (we put them into the support array) and we can access them from there to do the output computations.

stl

I understand the purpose of the change made from the last slide is to make use of the faster memory shared by threads in a block to keep the threads from having to go into the global memory and load all 3 values it needs for the computation. I'm a little confused about the actual computation that's going on. Specifically why are there 130 load instructions and not 128?

PID_1

@stl The 1D convolution operation updates a cell using its own value plus the values of its left and right neighbors. To update the values of 128 cells, we need to load 130, the 128 themselves plus the cells immediately to the left and the right of this sequence, as those are needed in the calculation of cell 0 (left most) and cell 127 (right most).

PID_1

I wonder how much of a performance impact it would be to have one thread load all 130 values instead of each thread loading 1 (or 2, in the special cases). The hardware shown for the GTX980 shows that we can execute four load instructions in parallel, however, the order of execution will be arbitrary. Loading from 1 thread serializes this step, but means that the loads will be sequential, taking advantage of any caching the system provides. In the case of a CPU, which prioritizes caching over bandwidth, this might actually overcome the 4x parallelization. But GPUs do not use caches as heavily, so maybe the 4x parallelism is the better choice.

karima

@PID_1 Good question!

The CUDA documentation tells us:

Global (device) memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.

When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.

In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a 32-byte memory transaction is generated for each thread's 4-byte access, throughput is divided by 8. source

Okay, but what does this all mean? Remember that just like ISPC, CUDA uses the SPMD (single program multiple data) model of programming. That means one instruction is executed across all CUDA threads in a warp simultaneously.

In this case, we're talking about a load instruction. Each individual CUDA thread does NOT send out it's own unique load instruction. Rather, all of the data requested by each individual thread gets "coalesced" into one memory instruction for the entire warp. Furthermore, these instructions must access contiguous 32, 64, or 128 byte chunks of device memory at a time AND each thread can at most ask for 4 bytes at a time within each warp-wide instruction.

Why is each thread only allowed to ask for 4 bytes at a time?

Think of it like, if I have to hand out candy to 32 demanding kids at my doorstep on halloween and I can only carry 128 candy bars in my arms at a time from the candy box in my kitchen to the kids, the fairest way to distribute the candy to the kids is 4 bars at a time to each because they all want some candy right now.

That is the way CUDA was designed, with this assumption that all threads in a block will probably require roughly the same amount data for each warp-wide instruction (they are after all executing the same instructions in lockstep so it makes sense that they'd all have similar (and hopefully contiguous) data demands).

So if you had one thread try to ask for all 128 bytes at a time this would be very bad. This would result in 32 warp-wide load instructions instead of just 1 because this thread can only ask for 4 bytes of global memory at a time.

This is why the way the loading is done on this slide is best practice.

lol

To further this discussion, shared memory is organized into smaller pieces called memory banks. Each of these smaller pieces can be accessed simultaneously by threads, so it is optimal to spread requests across banks. If you have multiple accesses to the same bank, you get a bank conflict, which serializes request.

In compute 2.x, banks map 32-bit successive words, and have width 32 bits. So thats why each thread can request 4 bytes. But technically it can request more: 8-byte or 16-byte accesses map onto successive banks. If no other thread is using these banks, these will be simulatenous. Large byte accesses tend to have many bank conficts, so to avoid this, the implementation above uses 4 byte accesses.

You can also optimize this by introducing a stride length which is larger than 1, such that your bank accesses are disjoint.

xingdaz

For the declaration __shared__ float support, I am assuming the variable support is only initialized once. Is it initialized at compile time or at run time by the first thread? If it is the second case, how do other threads know the the variable has already been initialized?

kayvonf

@xingdaz. shared float support[] is a per-block allocation. There is one instance of this buffer per thread block, and if you wish you can think of the allocation occurring at the time of block creation and being free at the time of block termination. The initialization of the block's contents is performed by the CUDA threads in the block, according to the code you see above.

PandaX

Is there a auto-cache mechanism in GPU, just like that in CPU?