I'm still confused about it. Can anyone explain to me why the first three operations need single cycle and the last one need 16 cycles?
I think that's because the 32 threads in a warp are only mapped to two banks (which are bank0 and bank16 as shown in the figure). So only two threads in a warp can access the data stored in banks (bank0 and bank16) at the same time. So the total cycles for all 32 threads in a warp to load data from banks are 32 / 2 = 16 cycles.
@ESINNG The first operation needs one cycle because all threads are requesting the same element stored in bank0.
The second and third operations need one cycle respectively because each thread is requesting element from different banks. All banks are busy, and each bank has exactly one thread.
The fourth operation needs 16 cycles because bank0/16 each has 16 threads. Since these 16 threads are requesting different elements from the bank, they must wait in line, hence 16 cycles.
@cgjdemo, @meatie Thank you for answering it. Yes, the 32 threads are for a warp. Now it makes sense.
Question: How might you design the hardware differently to avoid a hidden performance "cliff" like this?
Maybe enlarge the bus width of each bank so it can serve 64 bits or 128 bits each clock cycle?
I believe in any way we interleave, we could face this "cliff". Can having multi-ported banks help?
@kayvonf: You could design the hardware to avoid a "cliff" by implementing a randomized address to bank assignment. As accesses from memory generally follow some pattern, it's very unlikely that one would encounter the worst case performance of completely serialized memory accesses in real programs. However by not being able to predict how your memory accesses map to memory banks you lose the ability to optimize your program for maximum performance. Thus this randomization technique can be thought of as decreasing the performance variance which influences both the maximum and minimum possible performance in practice.