Previous | Next --- Slide 60 of 65
Back to Lecture Thumbnails
sfackler

Can GPUs actually do a 1-cycle multiply?

kayvonf

The short answer to your question is yes. In fact, is common for GPUs to sustain a throughput of one multiply-add per clock on most of their execution units.

Modern Intel CPUs can also sustain a throughput of one multiply + one add per clock per lane on the AVX units. There's one AVS unit that can perform an 8-wide 32-bit floating-point multiply per clock, and one that can perform an add. Both instructions can be issued simultaneously.

Source: David Kantor's realworldtech.com article on Sandy Bridge, "Execution Units" Section.

The real answer to your question again brings up the question of latency vs. throughput. By breaking instruction execution into a number of stages and then pipelining those stages it is possible to maintain a throughput of one instruction per clock even if the latency of performing that instruction is more than a clock.

Pipelining itself is a form of exploiting instruction-level parallelism.

We'll address pipelining in a general sense later in the class, however Hennessey and Patterson (see suggested textbooks on the course info page) is the best place to get the full treatment of a basic pipelined processor. See Appendix in C in the 5th edition, or Appendix A in earlier editions.

kfc9001

I guess I have another question, but probably more graphics-related: do most graphics algorithms really not touch memory that much? This algorithm above seemed fairly innocent, so I'm having a hard time believing graphics engines can really avoid touching the memory bus at all, to get anywhere close to 100% efficiency (or even 25%).

kayvonf

The ratio of math operations to memory operations is sometimes called the arithmetic intensity of a program. 3D graphics applications tend to have extremely high arithmetic intensity compared to most other applications, thus it makes sense to build an architecture that has a high ratio of compute capability to memory bandwidth. But yes, one common challenge of writing programs for GPUs is to overcome the memory bandwidth bottleneck. (Keep in mind that GPU memory systems typically provide far more bandwidth than the main memory system attached to a CPU... they have to or the problem addressed here would be even more severe.)

There's a reason Intel doesn't fill their chips with execution units. They are targeting a different, and significantly broader class of workloads.

Xiao

Are there any studies on the average arithmetic intensity of common programs and benchmarks? This sounds like a very useful topic to consider how to spend the budget on a chip.

DanceWithDragon

In this case, how do we get the number 480 MULS per clock and 8.4TB/s ?

kayvonf

@DanceWithDragon: see if you can answer your question using the information about this GPU given on slides 54-57.

kayvonf

@Xiao: Check out this article about the Roofline model. For the record (and since I did my Ph.D. at Stanford), I'd like to add that to the best of my knowledge the term "arithmetic intensity" was coined at Stanford. The Berkeley folks errored in neglecting to make that citation here. Go Card! ;-)

xiaowend

I'm just wondering how to get 8.4TB/s. I think each MUL requires 12 bytes per operation, there are 480 MUL per clock, and it can do 1.4G operations per sec. So data required per sec should be 12* 480* 1.4G=7.9T. Can someone correct me?

aakashr

I got the same answer @xiawend. I think we are doing it right, the 8.4 is probably an error.

kayvonf

@xiaowend, @aakashr: You are correct, I believe I modified the clock rate slightly and didn't update the final answer. The correct math would be:

(12 bytes/op) * (480 ALUs) * (1.4 billion clocks/sec) / (1 GB = 1024 * 1024 * 1024 bytes) = 7.5 TB