Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

toutou

Should size of block always a multiple of 32?

Kapteyn

You don't have to make it a multiple of 32, it's just more efficient to do so. I'm guessing that if you had, say 33 threads per block, the first 32 threads would run on one row of 32 SIMD execution units, and the 33rd thread in the block would take up an entire 32-wide SIMD vector but 31 out of the 32 lanes would be masked off.

Kapteyn

It seems to me that one NVIDIA GTX 680 SMX unit should be able to have a max of 64 warps * 32 SIMD units = 2048 cuda threads per block. But documentation on line says:

The NVIDIA GTX 680 can maximally handle 1024 threads per block and 2048 concurrent threads per multiprocessor.

Why is this?

xwenwenwen

A simple/naive question: At the end of lecture Kayvon described the situation when the available number of threads under a warp is less than the number of threads in a block. In this way, after executing the threads available for a warp, will the threads left in the same block continue to execute immediately on the SAME warp?

andymochi

@Kapteyn I think this is a good question. I tried to reason it out, but I'm having the same trouble?

I see on the wiki entry the number you are referring to. The GTX680 falls under compute capability 3.0 which says "Maximum number of resident threads per multiprocessor: 2048". This number, I think, refers to 8 32-wide SIMD ALUs (2 of which are the SFU and LD/STR) on all 8 of the SMX units. That's how we get 2048 = 8 * 32 * 8.

Here's where my logic gets fuzzy. It also says "Maximum number of resident warps per multiprocessor: 64". I'm not sure if this means that the entire GPU could only have 64 warps, or if each SMX core can have 64 warps (not clear from the slides either).

Either way, it seems the constraint is: # threads per block <= # Warps per SMX unit * Warp Width. Otherwise - we might face concurrency issues as exemplified on the last slide.

To be honest, I think the number that we're more interested in right now is actually 1536 CUDA cores per multi-processor which I got from page 6 of the Nvidia spec sheet. You can get this number if you exclude the SFU and LD/STR ALU sets on each SMX core (6 * 32 * 8).

doodooloo

So if the block size is 64 (threads), are we able to run two blocks on the same core?

kayvonf

@Kapteyn. Your calculation properly computed the maximum number of threads per SMX core = 2048 = 64 warps x 32 threads per warp.

The maximum number of threads per CUDA thread block is due to a limitation of the hardware, but it is unrelated to anything fundamental discussed in this class. (Your understanding of what you need to know is perfectly correct.)

The GTX 680 SMX core is able to concurrently run up to two blocks of 1024 CUDA threads each.

kayvonf

Note to everyone: This slide was a thought experiment where we assumed a simpler SMX core with support for only four warps of concurrent execution. A real SMX core can support up to 64 warps (2048 CUDA threads) of concurrent execution.

Please see slide 42 and slide 47 for nice discussions of the actual capabilities of the NVIDIA GTX 680 GPU.