Previous | Next --- Slide 35 of 54
Back to Lecture Thumbnails
akashr

shared variables allow you to share memory between all the CUDA threads on a block on the GPU. The GPU has a certain memory allocated for specific block and per specific thread.

mitraraman

All threads can read and write to a shared address space but there is also one instance of shared memory per thread. Threads belong to blocks but we must also remember that each block is assigned to a core so the threads run like SIMD instructions. However, to answer @kayvonf I don't think that CUDA threads in different thread blocks can communicate through shared variables, disproving your first statement that shared variables allow you to share memory between each block on the GPU.

kayvonf

@mitraraman: Are you sure about: "All threads can read and write to a shared address space but there is also one instance of shared memory per thread". All CUDA threads can certainly read and write to/from global memory, which is indeed a single address space accessible to all threads created during a kernel launch. But this conversation is specifically about variables declared with the shared keyword in CUDA. Is there an instance of these variables per thread? One per thread block? Or per program? Can someone clear all this up?

briandecost

Attempting to clear up what CUDA shared memory is:

slide 30 presents the CUDA memory model on the GPU nicely. (and @Max summarizes it nicely)

In the CUDA device address space, there are 3 types of variables:

  • local to a thread
  • local to a block (on-chip, declared with __shared__ keyword inside the kernel)
  • global (main memory, like the global arrays allocated in the host code with cudaMalloc())

so there's one instance of a shared variable per thread block, stored in on-chip memory (so there's less latency when accessing them).

bottie

I think we can say from a programmer perspective, the resources that each thread can access are:

1. r/w per-thread registers. (extremely high throughput.)
2. r/w per-block shared memory. (on-chip memory, so high throughput.)
3. r/w per-grid global memory. (something like main memory, so modest throughput.)
4. r Only per-grid constant memory. (since using caching, also high throughput.)
TeBoring

The answer is NO to both questions in the slide.

Blocks and threads are logical concepts in CUDA program. How to allocate resources depends on the GPU core. If there are 8000 blocks and only 4 cores, there will be only 4 instances of the shared float array. These four instances will be reused for these 8000 blocks.