Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 51 of 62

mchoquet

I'd like to try to summarize the architecture of this GPU:

The GPU is composed of some number of cores (SMX units), with a shared cache.
Each SMX unit has:
- A shared cache, which will be split into <= 16 thread block contexts (fewer if the 48K of shared memory isn't enough for 16 blocks).
- Space for up to 64 warp contexts, where a warp represents something that's running/trying to run on the core right now. There's 256K allocated for warp contexts, so if the contexts are to big it won't have space for 64.
- Four warp selectors, each of which can decode and send out <= 2 instructions from their warp's instruction stream at the same time.
- Six blocks of 32 general-purpose ALUs, and two blocks of special ALUs (one for math and one for loads/stores). These mean that the core supports 32-wide SIMD vector computation.

Putting this together, we can figure out the following:

The SIMD width is 32, so each warp cannot represent more than 32 CUDA threads.
CUDA specifies that all the threads of a block must be runnable on a single core, so a single block can't be split into more than 64 warps, or 2048 threads.
A maximum of 2048 threads per core means the chip can support at most 16384 threads at once.

This is all my best guess, so if I got it wrong please let me know :)

This comment was marked helpful 2 times.

kayvonf

@mchoquet: Solid start. I believe everything is correct except:

"At most four warps can be running at once, so a single block can't have more than 4 * 32 = 128 threads in it"
"A shared cache, which may or may not be split into <= 16 thread block contexts (some clarification on the splitting would be appreciated)."

Take a look at my long post on slide 52, then try and make an update to your post, and then you'll have it down.

This comment was marked helpful 0 times.

kkz

Is there a reason the FLOPS value is twice 8x192x1Ghz ~ 1.5TFLOPS?

This comment was marked helpful 0 times.

kayvonf

@kkz: multiply-add is considered two floating point operations. It's 1.5 tera-madds = 3 TFLOPS.

This comment was marked helpful 0 times.

kkz

@kayvonf ohhh ok, didn't know that haha

This comment was marked helpful 0 times.