Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Previous | Next --- Slide 63 of 79

jpd

An interesting note -- modern HD 3D video games can nearly max out this type of GPU for compute when on high settings, especially if there's geometry tesselation and complex shaders. But there is still a big bottleneck to get that data to the GPU from the CPU. GPUs come with so much RAM built in because local RAM, as slow as it is compared to caches, is still much faster than sending data across a PCI bus to the GPU.

bdebebe

@jpd I'm a little lost on the last part you said. Local ram is faster than sending data across a PCI bus to the GPU (why would it send data across a PCI bus back to itself)?

hofstee

@bdebebe the GPU has its own bank of RAM, typically a few GBs of GDDR5 nowadays. The GPU is connected to the CPU with a PCIe x16 bus usually, and the system memory is DDR3/4 at something like a 240-pin bus.

Bandwidth from the GPU to its GDDR5 is ~177GB/s as mentioned on the slides earlier. Bandwidth from the CPU to its DDR3 is ~12GB/s (per stick of RAM)
Bandwidth from the CPU to the GPU is ~15 GB/s (gen 3 x16)

So as you can see, if you can store everything you need in the GPU's GDDR5, we can access memory at 177GB/s. When you need to fetch something not on the GPU, you now have this bottleneck from the PCIe x16 at 15GB/s

In a cache like an L1, data transfer rates can be in the TB/s range, way faster than the GDDR5 on a GPU, but at the tradeoff of speed is capacity, and for a GPU we operate on a lot of data, so the capacity is more important.

(I think he might have meant to the CPU in the last sentence.)

thomasts

Are there many other common applications that even come close to using all 23,000 pieces of data? We see a few slides later that element-wise multiplication of vectors, a task that seems highly parallelizable, can be very inefficient due to insufficient bandwidth. And it seems like any task that involves 23,000 pieces of data would necessarily involve communicating with the cache, i.e. would necessarily require high bandwidth.

jellybean

For anyone wondering how 23,000 pieces of data was achieved, it's 32 pieces of data per warp, 48 simultaneously interleaved warps per core, and 15 cores per chip. 32 * 48 * 15 ~= 23,000.