Previous | Next --- Slide 61 of 70
Back to Lecture Thumbnails
haboric

Can anyone explain why x2, x3 cost single cycle, but x4 cost 16 cycles?

bojianh

Conflict miss?

randomthread

@haboric The goal is to load a single float for each of the 32 threads with the constraint that we can only load 1 word(float) from each bank in a single clock. If we consider the indexing for x2 and x3 the first 32 threads are loading from 32 different banks. As such we can load all the floats for the 32 simultaneously executing threads in one clock (by a single load from each of the 32 banks). In contrast with x4 the 32 threads are only indexing 2 unique banks. With the constraint of 1 word per bank per clock we require 32/2 = 16 clocks to load the data for the 32 threads (16 sequential loads from 2 banks).

KnightsLanding

thank you! @randomthread

If I understand correctly, 1 float will span 4 memory banks? There are 32 banks, and can serve one word of data to warp per clock. That's 32 floats per clock.

I find one article about the internals of SDRAM http://www.anandtech.com/show/3851/everything-you-always-wanted-to-know-about-sdram-memory-but-were-afraid-to-ask/2 .