Previous | Next --- Slide 52 of 62
Back to Lecture Thumbnails
lixf

How exactly does this scheduling process work? Is this implemented with only hardware support? Is it correct to think about this as putting a pile of paper on the table and each person just takes the next piece when they are done(as in the experiment on the first day of class)?

RICEric22

I am confused about warps. To my understanding, warps are groups of CUDA threads where each warp has a number of CUDA threads equal to the number of vector width (correct me if I'm wrong). But what determines how many warps are possible? There are 4 warp selectors in this architecture, so why is the maximum number of warp contexts 64?

mchoquet

CUDA thread blocks are assigned to groups of <= 4 warps based on the size of the thread block (warps have a max size of 32 because the SIMD width is 32), and up to 4 warps are running at any given time. Up to 64 saved warp contexts can exist at once to ensure that the GPU is always busy, even if the current warps have to do slow memory operations. See here and here.

-- EDIT --

My bad, CUDA thread blocks can actually be split into as many as 64 warps at once. The idea is that a block should all be able to run on the same chip, and chips have space for 64 warp contexts.

yanzhan2

The reason for 64 warps (2048 threads) in one core is that it needs to hide latency, remember in lecture 2, slide 47, we hide stalls for multithreading. In here, the situation is similar, but we use warps here, so if one warp wait for memory access, other warps could use the resources.

nbatliva

I have a couple of questions about how all of this works.

  1. The point of a warp context is to utilize all SIMD lanes by running a single execution from each CUDA thread on a single lane, right? (max 32 threads per warp because 32 simd lanes)
  2. Does a warp context have a fixed size? If not, then the 64 warp contexts per core is a generalization?
  3. What happens if the shared memory for a single block is very large such that when a core attempts to handle multiple blocks, the shared memory does not fit into the L1 cache. It says that shared memory is allocated on L1 at launch. Is a GPU smart enough to only process the single block (and associated threads) whos data it can store in the L1? Or will something else happen, like interleave processing of another block, and so L1 cache memory will be exchanged for the new block?
mpile13

I'm trying to understand the relation between blocks, warps, and cores. Can someone colloquially explain how these three things are related?

My understanding is that each core is broken up into warps and each warp contains a certain number of blocks, but I don't understand why we break down into each of these parts or what they're used for exactly.

yanzhan2

Warp typically has size 32, and it is executed in lockstep on cores. When you do not consider about performance, you can ignore warp because it is implementation detail, and you can not control warp in CUDA. Programmers would control how each block and thread behave, that is abstraction. Warp is execution(implementation) detail. Cores(SM or SMX in Nvidia term) would execute instructions, and blocks are mapped into cores. You can not say each warp contains blocks.

kayvonf

@yanzhan, technically the concept of a warp is exposed in CUDA. For example, CUDA has builtin "warp vote" instructions and warp shuffle instructions (see Appendix B.13 and B.14 of the CUDA Programmer's Guide version ) that allow the WARP implementation details to bleed into the CUDA language itself. However, other than these intra-warp communication primitives -- which did not exist in the original version of CUDA and were only added subsequently since they were easily supported in hardware and useful to programmers wanting to program GPUs for maximum performance --- in general it is usually perfectly fine to reason about a CUDA program without thinking about it's implementation on an NVIDIA GPU using warps.

kayvonf

An NVIDIA GPU executes 32 CUDA threads together in SIMD lockstep. That is, each of these 32 CUDA threads share an instruction stream. This is an NVIDIA GPU implementation detail and NVIDIA calls these 32 threads that share an instruction stream a WARP. (However, since it is common for CUDA programming make assumptions about the warp size, at this point it would be very hard for NVIDIA to change the warp size without forcing many CUDA applications to be rewritten.).

In practice, 32-consecutive threads in a block are mapped to the same warp. Therefore, a CUDA thread block with 128 threads, such as the convolve example introduced on slide 43 will be executed using four GPU warps. One warp will execute CUDA threads 0-31, another warp will execute threads 32-64 in the block, and so on. The 256-thread block introduced in the modified convolve program in slide 56 will be executed using eight warps.

It is very reasonable to think of a warp as "an execution context". When executing a warp, the GPU will fetch and decode the next instruction needed by the warp and then execute it on 32 SIMD execution units to carry out that instruction in parallel for the 32 CUDA threads mapping to the warp.

The Kepler GPU shown in the figure has support for up to 64 warp execution contexts. Since 32 threads map to a warp, that's support for up to 64*32 = 2048 CUDA threads. However, there are two reasons why the GPU core may run less than 64 warps at once.

  1. Notice the core has a total of 256 KB of storage for the local registers for these warps. So 256KB / 2048 = 128 bytes (32, 32-bit values) of register space per CUDA thread. If a CUDA program required more than 128 bytes of register space (as determined during compilation), the chip will only be able to run a smaller number of warps at once. So larger register footprint means fewer active warps. Fewer active warps means less concurrency, and therefore less latency hiding is possible using interleaving of warp execution.
  2. The second reason the core may run less than 64 warps of work at once is if there is insufficient shared memory. Each CUDA thread block may require it's own allocation of shared memory. For example, recall that the 128-thread convolve thread block required 130 floats of shared memory, or 130x4 = 520 bytes. In this case, the block does not require much storage and thus shared memory does not limit the total number of blocks that can fit on the core. However, if a CUDA program does require a lot of per-block storage, that resource requirement can limit the total number of blocks that can run on the core at once. For example, if the shared memory allocation requirements for the convolve program were sufficiently large that only two blocks could be run on the core at once, then only 2 x 128 = 256 threads (4 warps) could be active on the core at once.

Does that make sense?

squidrice

Now I have a clear understanding about the warps and threads. But I'm still thinking about the relation between blocks and threads. The maximum number of threads per block in CUDA is 128. Is this number arbitrary or has something to do with the hardware implementation? For example, memory on a chip.

yixinluo

@squidrice Maximum number of threads per block in cuda is 512. It is limited by the size for context store (warp execution contexts on the above slide) on each gpu core.

rokhinip

Also, from my understanding, a block of threads is partitioned into up to 64 warps. Since each warp is executed in SIMD lockstep, we are ensured synchronization amongst them. Therefore, __syncthreads is really used to synchronize across the various warps used to execute the CUDA threads?

kayvonf

@everyone. Clearing up some confusion... The maximum number of threads per block depends on the CUDA compute capability. That number is 1024 for compute capability 2.0 or higher, which includes the GPUs in the lab.

Since a warp consists of 32 CUDA threads, a thread block's execution may be carried out by at most 32 warps.

@rokhinip is correct in point out that __synchthreads is a mechanism to synchronize threads in a block, across warps.

Please see http://en.wikipedia.org/wiki/CUDA for a nice table.

Yuyang

Sorry I am a little confused by this part of clarification:

"Since a warp consists of 32 CUDA threads, a thread block's execution may be carried out by at most 32 warps."

So this is saying that, since each block has at most 1024 threads, and threads in a CUDA thread will always be in the same block (and that it will be stupid to take only, say, 14 threads out of a block that has at least 32 threads left to execute). So they can be carried out by at most 1024/32 that is 32 warps. Am I understanding this correctly?

Also a separate question, does the assignment of warp persist until all the computations are done in the thread? I want to say yes but I am a little unsure. Thanks!!! :D

arjunh

@Yuyang:

1) I would say yes, although @kayvonf could clarify further.

2) I believe so; the threads in a warp cannot move onto a different set of data until all the threads have completed. This makes sense as the warp structure provides a SIMD model of execution. The threads in a single warp block execute in SIMD style; they cannot progress until they have all finished executing the same set of instructions.