Previous | Next --- Slide 60 of 65
Back to Lecture Thumbnails
retterermoore

Is it a correct interpretation that this is kind of a neutral program, in that if you had 2 programs to run, you would run this on the GPU if the other program was efficient on the CPU and vice versa?

Also, it doesn't seem like for this particular problem there'd be any way to exchange increased math operations for decreased memory usage (while that could be used to make some other programs more parallel). Is this accurate?

smklein

We were asked the following question in class - "Would this program run 'better' on a CPU or a GPU?"

One might think that due to a GPU having a complex system of multiple cores, multiple ALUs per core, and a nice setup for SIMD usage, that a GPU would be superior to a CPU.

A GPU would be faster than a CPU for this example, but not primarily because of the high core/ALU count. A GPU would be faster because it has a wider bus, and therefore higher bandwidth.

GPUs are capable of doing a lot of stuff in parallel very quickly, but if our only operation is a itty-bitty MUL, the GPU spends most of its time just sitting around and waiting for the bus to bring A and B in, and to take C away.

With the example of the NVIDIA GTX 480 GPU, if we were able to magically double the amount of wires connected to main memory, yet keep the exact same number of cores/ALUs, we would potentially halve the execution time of this program.

taegyunk

I don't quite understand why GPU is more efficient that CPU in above context. Would somebody explain the phrase '~2% efficiency... but 8x faster than CPU!'?

jmnash

One thing I was thinking that wasn't mentioned in class was that the CPU has a much larger cache than the GPU does. This program seems like it would utilize a cache well, since the array elements would be stored next to each other in memory. So even if the CPU is slower overall, would getting A and B and storing C take less time with the CPU because of its larger cache? Or does that not really matter because the GPU has higher bandwidth?

yihuaf

@taegyunk I do not this slide suggests a conclusion to whether GPU is more efficient or CPU is more efficient. All this slide states is that both GPU and CPU are not efficient due to constraints of memory bandwidth.

Efficiency measure how well the program is utilizing all the existing cores. Due to memory bandwidth constraint, the GPU is only capable of achieving ~2% of the efficiency . The GPU may perform the same operations faster than the CPU can perform simply because GPU has more cores and this problem is ridiculously parallelizable. Therefore, the overall performance of GPU is ~8x faster.

yihuaf

@jmnash Be careful with the assumption you are making in the statement. It is true that a CPU has a larger cache than that of a GPU, but will this fact influence the performance of the program significantly?

First of all, write cache miss usually does not penalize the performance (depends on the cache writing policy) since the memory write can be queued and the processing unit (CPU or GPU) can continue the operations without stall. However, in this problem, the issue is that the memory bandwidth does not allow the processing units to perform more operations. Assume A and B are two different arrays with no overlapping locations and all elements in these 2 arrays are not inside the cache(starting with a cold cache), each element of the array need to be fetched from the memory. The mem read instructions are not faster. We still do not improve the performance or efficiency.

If both A and B can be fitted into CPU's cache but not into the GPU's cache, and starting with A and B already in the cache, we may observe a improvements in performance and efficiency from CPU.

mofarrel

@yihuaf @taegyunk The reason that the GPU is 8x faster in this case is simply that it has a wider bus.
See the speeds of GPU and CPU busses. $$ \frac{177\text{GB/s}}{21\text{GB/s}} \approx 8 $$ Both the CPU and GPU are both limited by the bus width for this example.

uhkiv

Is $$\approx 177Gb/s$$ the top of the state in bus speeds right now? Or is it that we CAN make a faster bus, but applications we use do not really call the need for it?

Interesting question: when performing a task like above, when is it better to use a distributed systems approach then a parallel computing approach? I think if we are cost-constrained, then it would be better to use a distributed systems approach, since (I think) it is cheaper to use multiple cheap machines then a supercomputer.