Slide View : 15-418/618 Spring 2014

Performance Optimization II: Locality, Communication, & Contention

Previous | Next --- Slide 32 of 42

Back to Lecture Thumbnails

iamk_d___g

Question: is R/W counted as a conflict, or they are using different wires?

This comment was marked helpful 0 times.

mchoquet

Unless I'm missing something, all 32 threads do the same thing on each clock cycle, so that situation (some threads reading and some writing) can't arise.

This comment was marked helpful 0 times.

mchoquet

Wait, that wasn't quite right. If this memory is shared across the chip then all currently running warps could access it. However, I don't think they can access it during the same clock tick since we only have 1 batch of 32 load/store units that all the warps have to share. I could be wrong again though.

This comment was marked helpful 0 times.

kayvonf

@mchoquet. This is a GTX 480 specific example. You're referring to the GTX 680. For the sake of understanding this particular example of contention, consider a single banked on-chip memory (as illustrated above) and a single 32-wide warp trying to access it.

Shared memory address x is serviced by x mod 32. And each bank can service one request per cycle, unless all requests are to the exact same address (special case 1).

The overall point is that this is an example of contention playing a big role in the actual cost of an operation. 32 loads with no contention are completed in one cycle. If contention is present, the 32 loads could take much longer.

This comment was marked helpful 0 times.

black

I really like this example. There are too many situations so that NVIDIA can't handle them all. They decide to have the special case that if all threads access the same bank, and discard the situation that all threads access two banks. Maybe bulk load's overhead is too high. And [3*index] doesn't have the problem because 3 is not a factor of 32, we are just lucky.

This comment was marked helpful 0 times.