shared variables allow you to share memory between all the CUDA threads on a block on the GPU. The GPU has a certain memory allocated for specific block and per specific thread.
This comment was marked helpful 0 times.
mitraraman
All threads can read and write to a shared address space but there is also one instance of shared memory per thread. Threads belong to blocks but we must also remember that each block is assigned to a core so the threads run like SIMD instructions. However, to answer @kayvonf I don't think that CUDA threads in different thread blocks can communicate through shared variables, disproving your first statement that shared variables allow you to share memory between each block on the GPU.
This comment was marked helpful 0 times.
kayvonf
@mitraraman: Are you sure about: "All threads can read and write to a shared address space but there is also one instance of shared memory per thread". All CUDA threads can certainly read and write to/from global memory, which is indeed a single address space accessible to all threads created during a kernel launch. But this conversation is specifically about variables declared with the shared keyword in CUDA. Is there an instance of these variables per thread? One per thread block? Or per program? Can someone clear all this up?
This comment was marked helpful 0 times.
briandecost
Attempting to clear up what CUDA shared memory is:
slide 30 presents the CUDA memory model on the GPU nicely. (and @Max summarizes it nicely)
In the CUDA device address space, there are 3 types of variables:
local to a thread
local to a block (on-chip, declared with __shared__ keyword inside the kernel)
global (main memory, like the global arrays allocated in the host code with cudaMalloc())
so there's one instance of a shared variable per thread block, stored in on-chip memory (so there's less latency when accessing them).
This comment was marked helpful 0 times.
bottie
I think we can say from a programmer perspective, the resources that each thread can access are:
1. r/w per-thread registers. (extremely high throughput.)
2. r/w per-block shared memory. (on-chip memory, so high throughput.)
3. r/w per-grid global memory. (something like main memory, so modest throughput.)
4. r Only per-grid constant memory. (since using caching, also high throughput.)
This comment was marked helpful 0 times.
TeBoring
The answer is NO to both questions in the slide.
Blocks and threads are logical concepts in CUDA program. How to allocate
resources depends on the GPU core. If there are 8000 blocks and only 4 cores,
there will be only 4 instances of the shared float array. These four instances
will be reused for these 8000 blocks.
shared variables allow you to share memory between all the CUDA threads on a block on the GPU. The GPU has a certain memory allocated for specific block and per specific thread.
This comment was marked helpful 0 times.
All threads can read and write to a shared address space but there is also one instance of shared memory per thread. Threads belong to blocks but we must also remember that each block is assigned to a core so the threads run like SIMD instructions. However, to answer @kayvonf I don't think that CUDA threads in different thread blocks can communicate through shared variables, disproving your first statement that shared variables allow you to share memory between each block on the GPU.
This comment was marked helpful 0 times.
@mitraraman: Are you sure about: "All threads can read and write to a shared address space but there is also one instance of shared memory per thread". All CUDA threads can certainly read and write to/from global memory, which is indeed a single address space accessible to all threads created during a kernel launch. But this conversation is specifically about variables declared with the
shared
keyword in CUDA. Is there an instance of these variables per thread? One per thread block? Or per program? Can someone clear all this up?This comment was marked helpful 0 times.
Attempting to clear up what CUDA shared memory is:
slide 30 presents the CUDA memory model on the GPU nicely. (and @Max summarizes it nicely)
In the CUDA device address space, there are 3 types of variables:
__shared__
keyword inside the kernel)so there's one instance of a
shared
variable per thread block, stored in on-chip memory (so there's less latency when accessing them).This comment was marked helpful 0 times.
I think we can say from a programmer perspective, the resources that each thread can access are:
This comment was marked helpful 0 times.
The answer is NO to both questions in the slide.
Blocks and threads are logical concepts in CUDA program. How to allocate resources depends on the GPU core. If there are 8000 blocks and only 4 cores, there will be only 4 instances of the shared float array. These four instances will be reused for these 8000 blocks.
This comment was marked helpful 2 times.