Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

cgjdemo

Is warp the smallest scheduling unit of NVIDIA GPUs? Is there some kind of scheduling mechanisms among threads inside the warps?

kayvonf

@cgjdemo: you are correct. The warp is the minimum unit of scheduling in current NVIDIA GPU implementations. (AMD GPUs use the term "wavefront" to refer to the same concept.)

Question: How would you compare an ISPC gang and a CUDA thread block?

cgjdemo

ISPC instances are similar to CUDA threads since they both run concurrently in a group. So I think ISPC gang is similar to CUDA thread block.

kayvonf

@cgjdemo. Excellent! And a more subtle question. What does a warp correspond to?

cgjdemo

A warp corresponds to a group of 32 CUDA threads inside a thread block?

afzhang

@cgjdemo: Yes! From my notes from class on Monday: "warp - 32 CUDA threads in a block together in SIMD, a.k.a. an execution context"

HLAHat

Why have warps at all then? If I understand correctly, a warp all gets executed at once with each thread in it getting the same instruction. Why pre-define the size of the warp? Wouldn't it be better to have the programmer be able to specify that along with block size, etc? I wonder if it's just 32 so that there's a higher guarantee that the execution units can be saturated more frequently. Since GPUs should only really be used for math intensive stuff, it's just easier to have the programmer toss a bunch of threads at the hardware?

JuneBot

@kayvonf If I'm interpreting it right, a warp is an implementation detail that isn't part of the CUDA abstraction, which involves many CUDA thread blocks. If CUDA thread blocks are analogous to ISPC gangs, would the warp analogy in ISPC be the concept of a task? And having multiple warps on different cores would be roughly equivalent to calling Launch[n]? I think this fits with the sharing instruction stream part of warps, at least.

But now I'm a bit stuck, because the last sentence in the first paragraph here seems to equate gangs with warps. On the other hand, this comparison makes it sound like ISPC needs to build in a second layer of grouping via "tasking-granularity" to interface with the CUDA model, so maybe there isn't a really good comparison for warps in ISPC due to the fact that ISPC only has a single layer of grouping computations on one core, and CUDA has two?

grose

Why have warps at all? Well, a warp has 2 decode units, I believe so that it can take advantage of instruction-level parallelism... Wait, the slide seems to contradict that by saying each warp is only 32 ALUs doing so many threads.

jiajunbl

Each warp consists of 32 cuda threads. which perform the same instruction set. So if we focus just on 1 warp, we can process 2 independent instructions, which can take advantage of 2 vectors of ALUs (64 functional units).

Warps are good because they allow you to store contexts. So if 1 warp is waiting on IO, we can execute instructions from another warp.