Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 62 of 66

okahn

What physical/architectural problems limit memory system bandwidth? Is bandwidth still increasing, or has it stalled like clock speed etc.? Did it track clock speed historically?

scedarbaum

@okahn I was curious about this as well. This article provides some interesting stats on trends in memory and CPU performance, including memory bandwidth. The author notes an approximate 2x increase for theoretical CPU memory bandwidth and an approximate 3x increase for theoretical GPU memory bandwidth between the years 2007 and 2013. The author also predicts a similar increase for the next several years due to DDR4 SDRAM hitting the market. However, I am still curious to know if their are any physical limitations on memory/chip hardware that would prevent further increases in the future (i.e. how heat has proven a limiting factor on CPU clock speeds).

kayvonf

Great questions both of you. I'd like to defer answering this in detail until the lecture on "How memory works" (or we can talk in office hours).

In short:

You will certainly see bandwidth increase over time (the coming trend is based on 3D chip stacking), but...
It is always more difficult to move information than it is to operate on it locally. (This is true not just in computing, its universal!)
You may be surprised at the relative energy cost of moving bits compared to performing arithmetic. A good rule of thumb these days is that a 64-bit integer op is about 1 picojoules (pJ), and a double-precision floating point op is about 20 pJ (these are estimates of just the cost of the arithmetic, not the cost of moving bits out of the registers to the ALU, processing the instruction stream etc.) An estimate of moving 64 bits to the ALU from an L1 cache is about 26 pJ. An estimate of moving bits to the chip from DRAM is about 1200 pJ.

What does this mean? It means that on an ASIC (a custom circuit to perform a computation, e.g., video decode, audio playback, etc.) you get as many as 1,200 integer math ops per double-word ready from memory. Locality and reuse matters for efficient computing. A lot.

ananyak

@kayvonf Interesting, so you're saying that even if we had infinite bandwidth, we would want to reduce bandwidth consumption to use less energy?

I'm curious how often bandwidth limitations come up in 'real programming'? The examples we considered in class and the homework involved very simple computations. Does anyone know any real programs that optimize for bandwidth consumption?

Also, are there nice ways/tools to measure the amount of bandwidth your program is consuming (for more complicated programs)? It seems like a non-trivial task for complex programs, and very dependent on the cache sizes/structure and how the cache is shared across threads.

kayvonf

@ananyak: I suspect that without a bit of thought and optimization, you will be bandwidth limited in most parallel programs you write on a modern computer.

Tools for profiling your programs do exist, and the TAs might talk about a few in a recitation later in the course.