Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

yikesaiting

I have a question: Why do we design such good execute units while keep the bandwidth a bottleneck?

One possible explanation I thought is that sometimes we will compute data within GPU(stored in cache maybe?) for several times before getting the results. That will leverage the hardware better.

bullseye

How was the 6.4 TB/sec of bandwidth to keep ALUs busy calculated? With 480 multiplications per clock and 12 bytes for each multiplication at 1.2 GHz, the bandwidth calculation comes out to 480 * 12 * 1.2 = 6.9 TB/sec of bandwidth. Am I not accounting for some variable?

misaka-10032

GTX480 has 15 cores and 32 ALUs each core, so totally (15*32=) 480 ALUs. When we make full use of ALUs, we can have 480 instructions run per cycle. As Frequency is 1.2 GHz, we can have (480 * 1.2 * 10^9 =) 5.76e11 multiplications per second. Each multiplication involves (3 int's =) 12 bytes, so total memory bandwidth needs to be (5.76e11 * 12 * 1024^-3=) 6.4 TB/s in order to support that many multiplications. Here we only have 177GB/s, so efficiency is only about (177 / 6400 =) 2.7%.

patrickbot

How could we theoretically get around the bandwidth bottleneck? One thing we discussed was using caches or using a better memory bus (?).

If we wanted to build a CPU optimized for this specific task of pointwise multiplication and we want to achieve high efficiency, what could we do (if we had a lot or infinite money)?

doodooloo

Since the ALUs in this processor run twice per clock as mentioned in previous slides, should there be 960 MULs per clock at most?

hofstee

I think the 480 MULs per clock is referring to the 32 wide warps, where there are 16 wide MULs at double clock rate working in the ALU for this operation.

doodooloo

@hofstee we have 32 wide warps per clock for one set of 16 ALUs, but in NVIDIA GTX 480, we have two sets of 16 ALUs per core. Thus the question (32 warps/set * 2 set/core * 15 core = 960 warps).

hofstee

Oh I see what you mean. I think it would be 480 MULs per clock (1 instruction per ALU), but 960 elements multiplied (unless we can't process a warp in 1 cycle).