Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 59 of 81

ardras

Around this point in the lecture, Professor Kayvon asked if a block consisting of 256 CUDA threads (8 warps) can run on a core that only has 128 threads (4 warps). Since there can be data sharing and synchronization between threads in a block, the answer to the question was no. I have 3 follow-up questions:

So does the compilation of such a program fail?
If the answer to 1 is yes, should we always make sure that the number of threads per block is less than 32 * (warps per core)?
If the answer to 2 is yes, do we have to keep changing our code for every hardware?

dyzz

A followup question: are there no possible ways the compiler/processor is able to be "tricky" and still complete the computation but with significant overhead? Would the penalty be too great?

I am thinking of something like partial computation then storing the execution contexts somewhere else and switching them out. Kind of like application threading(?).

kayvonf

The CUDA language provides different "CUDA Compute Capability" profiles that you can compile a program against. Profiles guarantee a certain number of resources (e.g., max registers per CUDA thread, or max CUDA threads per thread block). Different NVIDIA GPUs support different CUDA device profiles.

The GTX 1080 GPUs in the lab support Compute Capability 6.1. You can lookup the Compute Capability supported by various GPUs here:

The CUDA Programmer's Guide, Section G defines the various compute capabilities. You can find the same information about capabilities on Wikipedia: https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

Master

My understanding of a warp:

A warp is like a hardware thread in the CPU world. A warp corresponds to a set of 32 CUDA threads that share an instruction stream. It would be reasonable to think about a warp as a traditional x86 thread executing a SIMD instruction stream. Or a gang of program instances in ISPC.

paracon

About comparing gang to a warp, isn't that only true if the gang size is exactly equal to the SIMD width. If the gang size were a multiple of the SIMD width, would this analogy still hold true?

kayvonf

@paracon. Good comment. This is a small distinction. However, in practice, in ISPC even if the gang size is larger than the width of SIMD instructions, ISPC still keeps all instances in the gang in lockstep. So you can think of all instances in a gang was being implemented by logical SIMD instruction of width gang-size. That logical SIMD instruction might by implemented by two actual SIMD machine instructions.

rrp123

Do warps necessarily have to have 32 wide SIMD capability? Or could it just have a single ALU? If it can have a single ALU, then we wouldn't be able to liken it to ISPC right?

Thanks!

blah329

A warp is essentially the execution context of a particular set of CUDA threads that belong to the same block. NVIDIA currently has implemented warps to contain 32 CUDA threads. Warps exploit data parallelism by using SIMD instructions, to ensure that all of the CUDA threads that are being run in that warp execute the same instruction on multiple pieces of data through vectorization. This is reminiscent of how program instances within gangs work in ISPC by having each program instance control one vector lane. Warps can also exploit ILP by ensuring that all 32 CUDA threads execute more than one instruction in parallel. While executing a warp might seem like a highly ideal case since 32 CUDA threads are running in parallel, it is worth noting that this has the potential to not have much of a speedup due to divergent execution. Yet another detail to note here is that only one warp can be run at a time, but CUDA stipulates that when a thread block runs, all threads do run concurrently. This means that the warps in this case that contain the entire block need not run in parallel--it would not be possible on this core, since there are only 32 SIMD units--but as long as they are interleaved, then it is fine.

hpark914

Each warp has 32 CUDA threads. The 32 threads in a warp must be contiguous. Warp 0 has threads 0 - 31, etc. The number of warps on the execution context of a core is the number of threads possible on the core divided by 32.