Previous | Next --- Slide 23 of 54
Back to Lecture Thumbnails
jpaulson

Block groups are an implementation-level detail.

There is block-level shared memory and some synchronization operations at the per-block level.

monster

Here both blocks and threads per block are in the 2-dimension array. Each time you should tell the number of blocks and threads per block and the total threads you use is the number of blocks times the number threads per block.

Xelblade

threadIDx is akin to program index. blockDim is akin to program count. blockID is akin to task ID.

xiaowend

One advantage of keeping block-level memory is that this kind of shared memory is cheaper.

Tao

The shared memory is on-chip, smaller but faster. The idea might be similar to the cache in the memory hierarchy. Another benefit is that it allows the threads in a block to communicate via memory.

kayvonf

@Tao: Yes, you can think about the shared memory like a cache since data in shared memory is resident in low-latency, on-chip storage. The primary difference between shared memory and a traditional cache is that a program has to manually load data into shared memory (software manages what data is stored in the shared memory). In contrast, a cache is largely transparent to software: a program just issues loads and stores and the hardware manages what data is stored in a cache.

Another name for the storage used to implement shared memory is a "scratchpad".

If you read more closely about the details of an NVIDIA GPU, you'll find that each core has a fixed amount of addressable on-chip storage. This storage can be dynamically configured into a scratchpad region that is used for CUDA shared memory allocations and a part that functions as an L1 cache for off-chip global memory. On most recent NVIDIA GPUs (certain true of the 480 and 6xx GPUs in the lab), there's 64 KB of storage per core, of which at least 16 KB must be reserved for shared memory.