Previous | Next --- Slide 45 of 62
Back to Lecture Thumbnails
pwei

So at least to me, the device code is incredibly difficult to decipher. We start with some input array, and we make a shared array (so that other threads can pull from what the current thread is responsible for putting into the array) in order to copy over only exactly what we want. I believe that this would be faster since now the data is in per-block shared memory, rather than the device global memory (lecture 5, slide 40). After syncthreads, since the data is all in support, we can run the individual computation for each thread, and then put that result into output. The host code can then catch all the threads.

kayvonf

A CUDA kernel function, like convole shown above, is a function that is executed in SPMD fashion by multiple CUDA threads. Together these threads are called a CUDA thread block, and one way threads in a block can communicate is by reading and writing values in per-block shared memory (support is a shared memory array in this example). Like an ISPC gang, all threads in a CUDA thread block are executed concurrently, although there are no guarantees of synchronization after each instruction as is the case in ISPC (also see the discussion here). Synchronizing threads in a CUDA thread block requires explicit use of a barrier. You see this in the call to __syncthreads in the middle of the convolve function.

If someone wants to really check their understanding, I'd like to hear your thoughts comparing the concepts of an ISPC gang, a CUDA thread block, and an NVIDIA GPU warp. I'd also like to hear your thoughts on an ISPC uniform values vs. CUDA per-block shared memory values.

benchoi

Let me check: a warp is analogous to an ISPC gang, since the threads in each warp are executed in SIMD lockstep by the 32 ALUs in a single core. A block is an abstraction to make it convenient for the programmer, containing a set of threads which may share data, and the hardware can break down a block into multiple warps so that they can be executed concurrently for latency hiding.

ISPC uniform values are analogous to CUDA per-block shared memory values. However, because instances in an ISPC gang are guaranteed to execute in lockstep (unlike threads in a CUDA block), race conditions can more easily occur when updating shared memory in CUDA, and the use of barriers can help prevent this.

idl

@benchoi @kayvonf are the threads in each warp necessarily executed in SIMD lockstep by the ALUs? I was under the impression that yes they are executed concurrently, but not necessarily in lockstep.

EDIT: Oops I misunderstood, nevermind!

shabnam

Can someone explain "SPMD fashion by multiple CUDA threads ".

arjunh

@shabnam: I guess you could draw an analogy between CUDA and ISPC to understand this phrase; in ISPC, we essentially ran a program on a set of data elements in parallel using program instances. This is essentially the notion of SPMD (single program, multiple data).

You can draw a similar picture with CUDA threads and kernels; when a kernel is launched, a bunch of CUDA threads are created; each CUDA thread executes a kernel function.