Can anyone give me an explanation on why this is only 2% efficiency? We kinda burned through the last few lecture slides and I would like some clarification on this.
I guess the reason is that the memory bandwidth of GTX 480 GPU is only 177GB/sec. However, to keep all functional units of the GPU busy, we need to feed data very quickly to them, which need 7.5 TB/sec of bandwidth. So most of time, the functional units are idle since we cannot feed data quickly enough. So, the efficiency is only (177GB/sec)/(7.5TB/sec) = 2%.
In slide 37 we talked about the NVIDIA GTX 480 GPU as an example of a processor with 15 cores and 32 ALUs per core. That means there's 15*32 = 480 units on the chip that can perform a multiply per clock. The processor is clocked at 1.4GHz, so it can perform 1.4 * 10^9 * 480 = 672B multiplies per second.
15*32 = 480
1.4 * 10^9 * 480 = 672B
To perform all those multiplies, the processor needs to read the inputs to the operations from memory (A[i] and B[i]), and store the output to memory (C[i]). That means for each multiply operation, two floats are read, and one is written (a total of 3 * 4 = 12 bytes). So, in order to perform 672 billion multiplies per second, the processor needs to be able to read/write memory at a rate of (672 * 10^9 * 12) / (1024 * 1024 * 1024) = 7,500 GB/sec, that's 7.5TB/sec of bandwidth needed to feed all those ALUs!
3 * 4 = 12
(672 * 10^9 * 12) / (1024 * 1024 * 1024)
This GPU is actually hooked up to a memory system that can provide 177GB/sec of bandwidth, that's only about 2% of the required rate. As a result, for this computation, even assuming a best-case scenario where memory latency is perfectly hidden via multi-threading, the ALUs will be performing math about 50X slower than they actually can. Simply put, they're waiting on memory most of the time.
Another way to look at things is that for this specific computation, if we continue to assume all memory access latencies are hidden, a single-core version of this GPU would perform just as well on this computation as the 15-core version. In fact, it would STILL be bandwidth limited.
Perhaps we're not going to talk about this until later (or maybe not much at all), but I'm interested to know if there are any approaches commonly used to hide the need for high bandwidth to any extent. Specifically in this case, is there any way to get better efficiency than what's mentioned here to carry out this multiplication? My intuition says that there isn't -- we have to get some amount of data from memory, and that's going to be our limiting factor -- but I'm interested to see if there's some cute trick I'm missing.
Nope, there's no big trick to solve the problem. Caching doesn't help in this example because every piece of data is touches exactly once. I do list a few general tips for improving the arithmetic intensity of programs later in the lecture.
There is also the option of reducing bandwidth requirements by applying some form of compression for data stored in main memory. A hot area of computer architecture research at the moment is hardware supported compression, and Prof. Mutlu and Prof. Mowry at CMU have done some great work in this area. For example, you could take a look at this paper:
Linearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory Compression Framework
Although not applicable to this particular application since it compresses data in cache and not data in main memory, you might find this paper a good read, it is quite accessible:
"Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches
You can browse Gennady Pekhimenko's home page for more.
It seem to me that there is no way to make the GPU core 100% efficient due to the bottleneck of memory bandwidth. If that is the case, why is it necessary for us to have that many compute units anyway ?
@jing I think that in this example, the reason that the computation is so inefficient is that we are doing too simple an operation. Since multiplying is really easy, the bottleneck of the operation is in loading the data in A[i] and B[i].
If we did something more interesting than multiplying, for example approximated log[sin(A[i])+cos(B[i])] to 100 decimal places using a taylor series, we would come much closer to fully using our computational units. Here, we still only have to fetch two things from main memory: A[i] and B[i], but instead of just multiplying them, we spend many cycles computing log[sin(A[i])+cos(B[i])]. the result for each pair of elements.
Perhaps we can only do 10 of these complicated computations per second instead of 460. We would then need only about 160GB/s of data, which we are allowed.
So, this just seems to be a case of that speed vs efficiency concept earlier then. It's good that the code is running fast, but we're wasting a lot of resources to get it done. Given that the trade-off is always a case-by-case basis, I wonder if anyone has tried their best to document any guidelines on which way to go? But I suppose that there's really no substitute for experience and research. Either way, it seems that finding ways to more easily balance that trade-off would be an interesting subject.
Hey, I'm a little confused about this calculation in regards to which clock speeds we're talking about. Is the 1.4 GHz clock speed referring to the ALU clock speed or GPU Clock speed?
On the slide in lecture 2, slide 57 it mentions that the ALU clock speed runs at twice the rate of the processor clock speed. If the (1.4 GHz) figure refers to the GPU clock speed, then I think it should be 960 MULs per clock, and the end figure should be 15 TB/s ?
I tried to verify which clock speed was being referred to, but the specs for the 480 don't seem to match what is online: http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications . Did we use sample numbers for this problem, or have the specs for this component changed that dramatically?
@andymochi. Correct. I'm talking about ALU clock.
Interesting, I said 1.4GHz, that spec sheet says 1.2GHz. (apparently I was wrong. Thanks for the pointer.) This obviously doesn't change the point of the slide though, as the chip is still dramatically memory bound.
@kayvonf I vaguely remember that caches load data up in lines, not single elements so it should not matter that we touch each piece of data once because if we don't evict lines after every multiply we are reading off future elements from the lines we loaded after that first touch. Does it change the bandwidth needed in a meaningful way?
@Berry It doesn't really change the bandwidth calculation, since you have to move more data across the memory bus every time you access an element and miss the cache. If you load fewer elements each time you miss the cache then you can effectively use more GPU threads and vice versa if you load more elements on missing the cache.
@Berry: This computation must read all elements in the input arrays once, and write all output elements in the output array once. The presence of the cache only means that multiple elements (32 of them, in a single 128-byte cache line) are transferred from memory to the processor at once. The fact that after reading A[i], then A[i+1] is also in the cache is only relevant to a discussion of memory latency. Here we assumed latency was perfectly hidden (e.g., by a hardware pre-fetcher or by hardware multi-threading) and so the discussion is only concerned with having enough bandwidth to provide data at rate that equals the rate the processor can consume it.
Exercise for everyone: if you can't explain to yourself why the thought experiment on this slide holds regardless of whether there is a cache in the system or not, you should definitely chat it over with a friend or the TAs.