Previous | Next --- Slide 62 of 81
Back to Lecture Thumbnails
crow

I think i'm misunderstanding something because the math is not adding up for me. 2560 ALUs * 1.6*10^9 cycles / sec = 4.1 TFLOPs. Where did the missing factor of 2 come from?

Qiaoyu

@crow, I think it maybe because SIMD mul-add ALU will load two numbers for every operation.

crow

but loading two numbers still results in just one operation!

i did some reading on wikipedia and i think it might be due to the fact that each ALU can do a fused multiply add instruction, which is counted as two instructions instead of one, but i think this it is kind of cheating to count that as two ops instead of one, because i would not say it is something the programmer really has control over (as opposed to work balancing, paying attention to cache/locality, and other things which affect may prevent one from getting peak performance).

consider summing up an array of numbers. how is FMA going to help there? if this operation cannot even be used by an intelligent programmer to speed up a trivial computation like this, then it should not be used to compute the "peak performance" of the gpu.

of course, maybe FMA can be used there, and i just dont' know about it? this is possible.

kayvonf

@crow. When counting FLOPs (floating point operations per second) it is customary to count the multiply and the add performed by a muladd unit as two operations.