Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

GPU Architecture and CUDA Programming

Previous | Next --- Slide 38 of 81

Back to Lecture Thumbnails

kayvonf

Question: compare/contrast the following:

pthread
CUDA thread
ISPC program instance

poptarts

@kayvonf

In general, there exist a set of interfaces called POSIX threads (i.e. a pthread API) that can be used for general purpose threaded programming. Different operating systems have different implementations of pthreads, but their purpose is generally the same--to create a logical unit of work that can be scheduled by the OS to run concurrently on the processors. Another way to think about pthreads is as a single instruction stream that can share (heap) memory with other threads, and is scheduled by the OS onto a processor's execution context.
A CUDA thread, conceptually, is simliar to a pthread in that it also represents a logical unit of work. In the CUDA programming model, we have functions called kernels that are executed concurrently on a (NVIDIA) GPU by CUDA threads. Unlike pthreads, it is uncommon for a single CUDA thread to possess its own, unique instruction stream. Usually, a collection, or block, of CUDA threads share an instruction stream (the compiled kernel), and are executed concurrently on separate data using the GPU's SIMD units. More specifically a warp of 32 CUDA threads that share an instruction stream are executed simultaneously on a CUDA core capable of 32-wide SIMD operations. So while a pthread is eventually mapped to an execution context on a processor, a CUDA thread is mapped to a SIMD unit on a GPU.
An ISPC program instance is again an abstraction around some unit of work which may be done concurrently with other units of work. Similar to CUDA threads, a program instance rarely exists alone--they are part of gangs of instances. Similar to how CUDA warps are mapped to a CUDA core's SIMD lanes, an ISPC gang is eventually mapped onto the SIMD lanes of the CPU and executed concurrently.

So, all of these constructs are useful abstractions that represent units of work that can be done concurrently, and the details of how they are done concurrently vary (CPU core vs. GPU SIMD lane vs. CPU SIMD lane).