Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 52 of 81

axiao

This CUDA program won't create 1 million instances of local variables/per-thread stack. The thread block scheduler in the GPU will schedule thread blocks onto cores based on resource requirements. Until a thread block is scheduled onto a core by the scheduler, no variables are initialized for that thread block and the threads inside it. I believe the only space each thread block should take before being scheduled is just the resource metadata required by the thread block scheduler and instructions to run the thread block.

rav

No, the CUDA program won't create 1 million instances, the processing is done dynamically at a block/warp level, that is (multiple) sets of 32 threads are processed simultaneously. At any given time, only the support for the block(s) the warps corresponds to need to be stored.

Tiresias

When "creating a million threads" (meaning dynamically allocating work to the blocks until the work of 1 million threads is done), would the work assignment be first come first serve (assign next piece of work to first available block) or is there a more complex form of scheduling involved?

Also, would intelligent scheduling even help with this problem?