Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Previous | Next --- Slide 60 of 81

chenh1

If I just have one core capable of scheduling up to 64 warps, and launch 16 blocks( each block map to 8 warps), will this work?

kayvonf

@chenh1. Your answer lies in this example. Why don't you give it a shot.

slowloris

If I understand correctly, a warp is just the name CUDA gives to execution contexts.

mattlkf

To clarify: the fact that 32 CUDA threads in a warp are executed using SIMD means that even if their control flow diverges, two branches of control flow are not executed simultaneously, right?

For example, if the threads in a warp are executing this fragment of code:

if(threadIdx.x % 2 == 0){ ... do X ... } else{ ... do Y ... }

then X and Y will not be run at the same time? Instead, half the threads will do X while the others wait, and then half the threads will do Y while the others wait?

What I described seems consistent with the SIMD execution mentioned in the slide (and the statement that "performance can suffer due to divergent execution").

But this implementation of thread execution seems at odds with the programming model, which does nothing to discourage divergent control flow within threads. Furthermore, I'm curious to know how divergent control flow is "serialized" into a stream of SIMD instructions even in the presence of many branches. (And, does the "serialization" happen at compile-time or runtime?) That's why I'm posting this :P