Can anyone explain why the three commands cost single, single, and 16 cycles?
Penguin
The first 2 only take one clock cycle because each of the indices fall on a different bank of the shared memory so you can instantly read the value. However, the last one has half of them in one bank and half in another, so you have to read from two banks 16 times each to get the 32 values you want since you can only read one value per clock.
200
Each bank can only provide one value per cycle, therefore, if multiple threads are requesting values from a same bank concurrently, these requests would queue up due to the contention.
locked
Shared memory address is stored in banks and the bank number is (memory address)%(number of threads in a warp). Each bank can only perform one memory load in a cycle so we need to be careful when we access shared memory in threads.
kayvonf
Question: I'd really like to see someone run this experiment on the GTX 1080's in the lab and report your results.
Can anyone explain why the three commands cost single, single, and 16 cycles?
The first 2 only take one clock cycle because each of the indices fall on a different bank of the shared memory so you can instantly read the value. However, the last one has half of them in one bank and half in another, so you have to read from two banks 16 times each to get the 32 values you want since you can only read one value per clock.
Each bank can only provide one value per cycle, therefore, if multiple threads are requesting values from a same bank concurrently, these requests would queue up due to the contention.
Shared memory address is stored in banks and the bank number is (memory address)%(number of threads in a warp). Each bank can only perform one memory load in a cycle so we need to be careful when we access shared memory in threads.
Question: I'd really like to see someone run this experiment on the GTX 1080's in the lab and report your results.