Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 63 of 72

Back to Lecture Thumbnails

shiyqw

Can anyone explain why the three commands cost single, single, and 16 cycles?

Penguin

The first 2 only take one clock cycle because each of the indices fall on a different bank of the shared memory so you can instantly read the value. However, the last one has half of them in one bank and half in another, so you have to read from two banks 16 times each to get the 32 values you want since you can only read one value per clock.

200

Each bank can only provide one value per cycle, therefore, if multiple threads are requesting values from a same bank concurrently, these requests would queue up due to the contention.

locked

Shared memory address is stored in banks and the bank number is (memory address)%(number of threads in a warp). Each bank can only perform one memory load in a cycle so we need to be careful when we access shared memory in threads.

kayvonf

Question: I'd really like to see someone run this experiment on the GTX 1080's in the lab and report your results.