Slide View : 15-418 Spring 2013

Performance Optimization II

Previous | Next --- Slide 30 of 40

Back to Lecture Thumbnails

TeBoring

If different threads access same data from some place of shared memory, shared memory can broadcast it to all threads.
If different threads get data from different parts of shared memory, they can do it in parallel.
If different threads access different data from the same place at the same time, there will be conflict, called banking conflict (because the unit of shared memory is called banking)

This comment was marked helpful 0 times.

lazyplus

Question: Is there a typo in the code of x3?

EDIT: This paragraph is deprecated due to slide change. The code of float x3 = A[index / 2] would generate the accessing adress of 0, 0, 1, 1, ..., 15, 15. Then each bank would serve the same value twice, which could be done in one cycle.

I think float x3 = A[index * 2] would generate the accessing address of 0, 2, 4, ..., 32, 34, ..., 60, 62. Then 16 out of the 32 banks would serve two different value, which needs two cycles.

This comment was marked helpful 0 times.

kayvonf

@lazyplus. You're thinking is good. However, my understanding is that on the earliest CUDA-capable GPUs, two CUDA threads accessing the same value from one bank would still take two cycles. The bank is designed with the the ability to give one value to one warp thread, or one value to all warp threads. It can't give the same value to two threads in a cycle.

Since this is really confusing, what I did was change the slide to be a lot more clear. I eliminated the confusing example and went with an example, much like yours above, except I decided to multiply index by 16 to generate a very bad case that would take 16 cycles for the hardware to service.

EDIT: @apodolsk below indicates by two-cycle explanation is not correct on most modern GPUs (instead, his explanation indicates the cost of an access is proportional to the maximum number of unique values that must be retrieved from any one bank). I'll have to look into whether I was just wrong, or if the status quo is an improvement on the earliest designs.

This comment was marked helpful 0 times.

apodolsk

NVIDIA has some more nice diagrams:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-3-0

Compiling rules that are scattered around the doc somewhat haphazardly, I get this set of quotes for compute capabilities 2.0 and 3.0:

(A) "Shared memory has 32 banks."

(B) "Successive 32-bit words map to successive banks." This plus (A) implies bank = addr % 32, as in the slide above.

(D) As a bonus, "a shared memory request does not generate a bank conflict between two threads that access ... the same 32-bit word". That is, any number of threads can read a single 32-bit word simultaneously, even though they're all accessing the same bank.

Note that this is different from example 4 on the slide and in @kayvonf's post above, since that example involves reading things from the same bank and also from different 32-bit words.

(I couldn't find where it says that all threads have to write to the same 32-bit word in order for this rule to apply, for GPU's 2.0 and up. It looks like the lower-middle diagram in that link shows an instance of partial-broadcast for 3.0, and they say it's "conflict-free".)

(E) As a superbonus on 3.0 GPUs, multiple threads can simultaneously read from "32-bit words whose indices i and j are in the same 64-word aligned segment and such that j=i+32".

That is, if (arr % 64 == 0), then ((int32_t *)arr)[0] and ((int32_t *)arr)[32] can also be read at the same time.

The mental picture that I get from this is that a single bank would look like (int32_t []){arr[0], arr[32], arr[64] ....}. So one worst-case access pattern would be arr[0], arr[64], arr[128]... (It's not arr[0], arr[32], arr[64] because of rule E).

This comment was marked helpful 1 times.