Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

bochet

Is there any limit on how large a thread block can be?

In GTX 1080, one SM core can handle up to 64 warp * 32 = 2048 threads. If there is a shared memory for each thread block, and the thread block contains more than 2048 threads, how can gpu guarantee that each thread within a block have access to the same shared memory storage?

firebb

If GPUs are designed to execute single instruction on multiple data, why do we need 2 fetch/decode unit on a single warp?

What does "Select up to two runnable instructions per warp" mean? Does it means a single warp could have 2 execution context for two instruction flow or 2 instructions in a single instruction flow?

bochet

@firebb up to two instruction is because of instruction level parallelism, it's in single instruction flow

unparalleled

I had to clarify one basic question. The number of threads per warp is determined by the width of the SIMD unit. Since we have 32 wided SIMD, each gang will process 32 instance at a point in time. Is my understanding correct here?

paracon

@unparalleled, Yes. But there is instruction level parallelism as well, so another no SIMD instruction could be also executed along with it.

Question : The diagram implies that the number of warp selectors depends on the number of cores. If the number of threads per block were set to be more than 128 threads per block, how would the SM core function? (Do warp selectors choose the warp to execute the threads in?)

kayvonf

@unparalleled. This lecture is calling about CUDA code, so there is no concept of a gang in CUDA (That is an ISPC concept). There are only CUDA threads, organized into CUDA thread blocks. The number of threads per warp is an implementation detail of the GPU hardware. And yes, NVIDIA GPUs share an instruction stream across 32 CUDA threads. You can think of a warp as the hardware execution context corresponding to 32 CUDA threads that share an instruction stream.

kayvonf

@paracon. This slide shows only one core (one "SM" in NVIDIA-speak). There is no correspondence between the number of warp selectors and the number of cores. Each clock, four of the up-to-64 active warps are chosen for execution. (That's the role of the warp selector.) Instructions from these four warps are executed on the core. If this core was tasked to run a CUDA thread block with 1024 CUDA threads, then that thread block would require 32 warps (32x32=1024). Each clock the core would choose four of those 32 warps to execute instructions from.

In the microarchitecture of this GPU, there's a correspondence between a warp selector and the hardware units that run the instruction(s) coming from that specific warp. You can see this in the visualization, as each warp selector (and its corresponding instruction decode units) are located right next to a set of SIMD ALUs they control.

kayvonf

Question: Explain the statement: The GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution.

paracon

GTX 1080 SM combines the ideas of interleaved multi-threading, simultaneous multi-threading, SIMD execution, and super-scalar execution in the following manner :

Interleaved Multi-threading : Since up-to-64 warps could be active at anytime, multiple threads could be interleaved depending on which 4 warps are chosen each time.
Simultaneous multi-threading : Each SM has four warp selectors. Each warp selector can choose to issue instructions from different threads.
SIMD Execution : There are SIMD functional units in each SM that helps do SIMD operations.
Super-scalar Execution : During each cycle, the warp selector can issue upto 2 instructions to be executed, supporting super scalar execution.

I have based my answers on threads, and not CUDA threads, is that the right way to approach it?

kayvonf

@paracon: This is a very nice answer. And you are correct to talk about warps (not CUDA threads), since we are describing the behavior of the GPU core.

However, you would have also been correct to take a CUDA thread-centric answer to my question:

Up to 64*32=2048 CUDA threads can be active on the SM at a time
But only 4*32=1028 CUDA threads are executed simultaneously.
Groups of 32 CUDA threads share an instruction stream. (and are executed in a SIMD manner)
The SM may choose to execute 2 instructions from a CUDA thread at once when an appropriate mixture is available. (And of course each of those two instructions are executed in a SIMD manner as described above.)

rc0303

I'm curious about the logic contained in the warp selectors/schedulers. Do they select warps based on recently executed instructions, or just availability of active threads? After searching for a few minutes, I haven't found any more details from Nvidia directly.

sadkins

I am a bit confused on what exactly a "warp" is. What is the difference between a warp and a block of CUDA threads?

jkorn

@sadkins slide 60 might be useful. A warp is a group of 32 CUDA threads that share an instruction stream (they are scheduled together and map to a single execution unit). Whereas in a block, you could have blocks that are larger than 32 threads, but then each block will still be split up into these size 32 warps (but still will share thread group properties such as shared variables, etc.).

-o4

Correct me if this is wrong: We have for this specific GPU: 2-wide ILP, 4-way SMT (simultaneous multi-threading), 4 simultaneous executing warps, 64 interleaved warps

machine6

In addition to what @jkorn said, warps are a concept that pertains to the GPU hardware; the actual CUDA software abstraction has no reference to what a 'warp' is, it only knows what a block is, what a CUDA thread is, and so on.

rjvani

I'm a little confused on how instruction streams are loaded for warps. Where are these instructions actually stored?

kayvonf

@rjvani. I'll start with a question. Where are instructions actually stored in a regular CPU?

rjvani

@kayvonf The instructions are stored in memory. I'm just confused on how execution switches between CUDA threads. Is there no explicit OS context switch required since the the group of CUDA threads in a warp correspond to one execution unit?

kayvonf

@rjvani. Correct about storing in memory. Now let me ask another one to you... On a multi-threaded CPU (such as an Intel CPU with Hyper-threading), there is no OS involvement in the CPU switching between two hardware threads, right?

rjvani

@kayvonf Correct, it doesn't involve the OS. (This cleared up a lot.)

SR_94

So the GPU is responsible for switching between different warps and loading their instruction streams?

ShadoWalkeR

@SR_94, GPU is responsible for scheduling a warp on the GPU processing units. But my understanding is that there is no context switching on GPU. Once scheduled, a warp has to complete its execution before another could take its place. This can give rise to deadlocks if threads in the same block are synchronized incorrectly.

paracon

@SR_94,@ShadoWalker, the GPU does not perform context switching, but it does select which 4 warps among the active ones can it schedule. And if the warps have to do a memory access(high latency operation), the GPU would want to schedule another warp in its place. The correctness of synchronisation logic is the responsibility of the programmer I believe.

yes

That seems to be the case @paracon, given slide 76 as thread blocks can be scheduled in any order and the system assumes no dependencies between blocks.

kayvonf

@paracon, @ShadoWalkeR. There is potential confusion here because of the use of the term context switching.

I'd say the GPU absolutely performs context switching because it performs interleaved multi-threading. It maintains multiple execution contexts (warps) and chooses among different contexts to draw instructions in each clock. From the perspective of the ALUs, the context being executed is changing at a per-instruction granularity.

However, what is not happening is that software (the GPU driver part of the OS) is not trapping the GPU, preempting running threads, and then replacing them with new threads. In your posts, you are using the term "context switching" to mean OS-managed interleaving of software threads (or processes) onto hardware execution contexts. To a hardware architect, context switching can refer to interleaved multi-threading.

Modern GPUs, to support features like virtualization and CUDA dynamic parallelism, are beginning to support limited forms of software-managed thread pre-emption, but we will not talk about it in this class.

ShadoWalkeR

@kayvonf, Thank you for the clarification, Prof. Yes, I was using 'context switching' as OS controlled interleaving of different threads.