Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Richard

I'm a little confused by the "instruction-level parallelism"...

I see there are two Fetch/Decode modules for each chosen warp, but how do these two instruction streams execute in parallel?

fleventyfive

Why does the slide say "A convolve thread block is executed by 4 warps"? Where does this 4 come from?

Richard

@fleventyfive There are 4 "warp selector", meaning at any time, there are 4 warps chosen from the 64 warps being executed. This "4" isn't from the program, but due to the hardware.

fleventyfive

Oh I see... Thanks!

grizt

Apparently, according to Tom's Hardware, the term "warp" comes from weaving, where you have (literally) a set of parallel threads. Can anyone verify if this is in fact correct?

yikesaiting

Warps are GPU implementation detail, not CUDA. Important

TanXiaoFengSheng

Suppose a block is mapped to exactly 4 warps in a core, do these 4 warps have to be scheduled to execution unit at the same time? If not, how to solve the deadlock in the following scenario: 3 out of 4 warps have been scheduled to execute, but the remaining one execution unit is occupied by a warp from another block and all of them are waiting for some "event"s: the three warps from same block are waiting for the same event from the unscheduled warp, and the remaining warp in execution is waiting for another event from his block, then since scheduler does not preempt, they would wait forever.

kayvonf

@grizt. Yes, that was the inspiration for the term warp.

https://en.wikipedia.org/wiki/Warp_(weaving)

kayvonf

@fleventyfive, @Richard. A convolve thread block is executed by 128 CUDA threads (as defined by the program). 32 CUDA threads map to the same warp, meaning that the execution of these threads share an instruction stream. Therefore, 128 / 32 = 4 warps are required to execute one thread block from the program.

It is true that the GTX 980 SMM can select up to four warps to run per clock. However, that is not what is being referred to in the line "a convolve thread block is executed by four warps."

More detail on warps here.

kayvonf

@Richard. The SMM will select four of 64 possible warps to run. Then it will be able to run up to two instructions from that instruction stream, if independent instructions exist in those warps. Therefore, the system is utilizing simultaneous multi-threading to run multiple warps at once, and it is also leveraging instruction-level parallelism with a warp's instruction stream to potentially execute multiple independent instructions from the same warp at once.

CaptainBlueBear

@TanXiaoFengSheng I don't think that scenario is actually possible. From the scheduling example from slides 60 and onwards it seems block execution on a core aren't interleaved. So in your case, all 4 of the warps of the first block have to complete execution before the other block's warps get executed. Someone correct me if I'm wrong.

kayvonf

@CaptainBlueBear, @TanXiaoFengSheng. Multiple thread blocks can certainly be executed concurrently on a SMM core provided the resources are available. Note this is exactly the case in the example at the end of this lecture, where a core has sufficient resources to have two thread blocks executing it at the same time. In that example, the two thread blocks of 128 threads occupy a total of (2*128)/32 = 8 warps, and those warps certainly execute on the core in an interleaved fashion.

Let's going back to the description of how the SMM core works on this slide, and review:

Each clock, the GTX 980 SMM core:

Selects up to four unique, runnable warps to run instructions from. These four warps can come from any thread block currently active on the core. This is an instance of simultaneous multi-threading (lecture 2).
From each of these four warps, the clock attempts to find up to two instructions to execute. This is instruction level parallelism (lecture 1) within the warp. If independent instructions are not present in the warp's instruction stream, then only one instruction from the warp can be executed. There is no ILP in the instruction stream!
Out of the eight total instructions the core tries to find (across the four threads), up to four of those can be arithmetic instructions. These instructions will be executed on the four different groups of 32-wide SIMD ALUs that the core has. To be absolutely clear, all 4x32 = 128 ALUs in the SMM execute these four 32-wide instructions at the same time -- true parallel execution. (As pointed out in the footnote, and discussed a bit in the discussion on the previous slide, other non-basic-arithmetic instructions, like loads and stores, or special arithmetic ops like sin/cos/pow can also be executed simultaneously with these four basic-SIMD-arithmetic ops, but I didn't draw the units that perform these operations on this diagram.)

Note that in the diagram above, the core has the ability to maintain execution contexts for up to 64 warps at once (if you like to think in terms of CUDA threads rather than warps, this is 64x32=2048 CUDA threads). These warps are executed concurrently by the SMM core, but in any one clock the core will only execute instructions from at most four of them. This is the idea of interleaved multi-threading as illustrated here.

Do not confuse the requirement that all CUDA threads (or their corresponding warps) in a thread block must be live--a.k.a. occupying an execution context on a core-- during the lifetime of the thread block (a requirement that we discuss again on slide 73) with the fact that the core can indeed run instructions from multiple threads simultaneously on its parallel execution units. It seems to me that interleaved multi-threading and simultaneous execution of instructions from multiple threads are being confused in some of the comments posted above.

To check yourself, go back and take a look at the example I give at the end of lecture 2, which puts interleaved multi-threading, instruction level parallelism, and SIMD all together in a simple example that is very similar to modern quad-Intel Core i7 CPU. There are no new concepts here on the slide above, they are just pushed to a larger, and a bit more exotic scale. And that's the whole point! I don't want anyone thinking of a GPU core as some radical thing. GPUs architectures just embody basic throughput computing principles that modern CPUs do, but do so to an extreme degree!

CaptainBlueBear

@Kayvon In reference to "all CUDA threads (or their corresponding warps) in a thread block must be live during the lifetime of the thread block", if we somehow had only 4 warps on the core but 256 threads per block (so we need 8 warps), will our code not compile? Or just return an error code?

kayvonf

@CaptainBlueBlear. All CUDA programs are compiled against a particular CUDA Compute Capability, which defines to minimum requirements of a CUDA GPU to run the program. A program that required 256 CUDA threads per thread block would not compile if the compiler was told to target GPUs that only supported 128 CUDA threads per SMM.

Wikipedia has a nice listing of what compute capability is supported by what NVIDIA GPUs, and the NVIDIA programmers guide has a complete description of what each compute capability means in terms of number of threads per block, blocks per core, size of shared memory, etc.

Richard

@CaptainBlueBear By saying "if we somehow had only 4 warps on the core", do you mean only 4 warps executed simultaneously (like architecture in this slide) or only 4 candidate warps (there are 64 candidate warps in this slide) ?

Wiki gives such specs for GTX 980: Maximum number of threads per block: 1024;

It's not possible if these 1024 threads have to be executed exactly simultaneously... @Kayvonf correct me if I'm wrong...

CaptainBlueBear

@Richard, I meant only 4 candidate warps.

As @kayvonf explained (and just to reiterate to make sure I understand), the threads in a block don't all need to be executed simultaneously (there is in fact, interleaving of different threads) but they do all need to fit on the same core (so on the candidate warps).

So in terms of my question, we'd be fine if there were only 4 warps executed simultaneously but not if there were only 4 candidate warps (in which case our code wouldn't compile)