Previous | Next --- Slide 54 of 62
Back to Lecture Thumbnails
lixf

I'm a little confused by the calculation on 'how many of what do we have'.

  1. does each block(abstraction) correspond to a warp(implementation)? Or does one block correspond to multiple warps if we have more than 32 thread in one block?

  2. Since we can only run 6*32 threads at the same time on one core (limited by the number of SIMDs), why do we ever need 64 warps each with 32 threads?

  3. Also, why are there 8 Fetch/Decoder units? Does each one of them serve one warp?

I feel like there's a certain degree of hardware modularity that separates the number of threads, the number of decoders, warps etc. can someone explain how these work together please.

mchoquet

See my attempted explanation here for my thoughts about how blocks/warps work.

As for (2): the point of having more warps is (I think) so that we can ensure something is always running. If a warp is currently stalled waiting to read/write to memory, then at least 1/4 of the chip's computation power won't be used until it's done. Having other warps ready to step in and do some work until the first warp returns from it's slow operation means that more work gets done overall. See here for the relevant slide.

(3). My understanding is that two fetchers serve each of the 4 warps that are currently running. This way all four warps can have 2-way ILP at the same time.

cwswanso

So in lecture we pointed out that blocks are independent, whereas threads might need to communicate (as with a __syncThreads() call). Why exactly did we point that out? I know it led to some sort of deadlock situation, but I'm not quite sure how. It may have involved having too many warps trying to run, or having one warp execute threads from multiple blocks?

yanzhan2

The professor made the assumption that for now each block contains 256 threads, but in one time we can execute 4 warps = 128 threads at one time. So the question is, should we finished the 128 threads and then for the next 128 threads left? The answer is not, because it cannot be finished, as 256 threads in one block needs to reach the syncthreads, which is not possible if we want to finish 128 first.

Besides, I believe warps are from the same block.

sbly

So if my understanding is correct, warps are an implementation detail on (most/some/all?) GPUs, but there is nothing in CUDA about them. That is, are CUDA code may be very inefficient in that it is not parallel, i.e. unable to execute the same instruction on multiple (up to 32) different data streams. This is different from using ISPC and vector intrinsics, where all the performance gain is from taking advantage of multiple ALUs. In CUDA, the focus seems to be on gaining performance from using multiple cores (although there are also many ALUs per core). This makes sense as GPUs have many more cores than CPUs, so this is the main source of parallelism. Is that correct?

kayvonf

@sbly: in many situations, the warp concept IS essentially just an implementation detail, and this was essentially the case when CUDA first came out. But technically, the concept of a warp is exposed in modern incarnations of the CUDA programming model. Take a look at my comment on slide 52 about how the warp-based execution (largely an implementation decision) bleeds into the modern CUDA programming model.

kkz

@kayvonf Do we suffer any inefficiencies when are blocksize is not a multiple of 32? I'm assuming each warp cannot serve more than one block (it wouldn't make much sense otherwise).

kayvonf

@kkz: Yes. There would be one warp that was only "partially full" with active threads. This is a case of execution divergence, since some lanes of the warp would essentially always be masked off.

uhkiv

So from the discussions I can understand that a warp can support 32 threads because of SIMD. These 32 threads must have their own stacks, from what I understand from previous slides, and I'm assuming they are in the warp. I can see that each warp can hold about 256kb, so does that mean each thread gets about 8kb worth of stack memory? Correct me if I'm wrong.

yihuaz

So from what I understand now, a CUDA thread is executed on one SIMD lane. A warp supports 32 CUDA threads. Does that mean the SIMD unit is 32-wide for this specific GPU?

kayvonf

@yihuaz: Yes.