Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

GPU Architecture and CUDA Programming

Previous | Next --- Slide 23 of 54

Back to Lecture Thumbnails

kayvonf

Question: Now that we've had experience in class with CUDA, who would you answer these questions?

Kapteyn

Data-parallelism

CUDA is a good example of a data-parallel programming model because in CUDA we launch multiple CUDA threads that all execute the same kernel function. Programs that follow the SPMD model are ideal for running on a GPU because each GPU core has rows of SIMD vectors which can be used most efficiently if we are running the same instructions on multiple pieces of data. That is why GPUs are ideal for rendering images because we typically have to perform the same computations on each pixel in a grid of pixels.

Shared address space vs message passing

CUDA code that runs on the GPU exhibits the shared address space model. In CUDA there are two levels of shared address spaces because threads in a block can share memory on a single core and all threads launched in a kernel share global GPU memory.

The transfer of data between host and device code exhibits the message passing model because one must memcpy data from the CPU to the GPU, i.e. transfer it over the PCIe bus which can be a significant bottleneck in a program.

Analogies between ISPC and CUDA

An ISPC instance is analogous to a CUDA thread and a gang of ISPC instances is analogous to a warp.

In fact, based on our knowledge of the implementation of ISPC, we know that the number of instances launched in a gang will typically be determined by the hardware and will be SIMD vector width of the CPU. Similarly, the number of CUDA threads per warp is determined by the SIMD vector width of the GPU (unless you've allocated a number of threads to each CUDA thread block that is not a multiple of the warp-size).

Launching multiple ISPC tasks is similar to a kernel launch. When you launch multiple ISPC tasks, you create multiple gangs, and each gang can run on a separate core. ISPC tasks allow parallelism across cores while ISPC instances allow parallelism across SIMD vectors. Similarly to how you can dictate the number of thread blocks that get launched in CUDA, you can tell ISPC how many tasks to launch where each task is guaranteed to run on the same core just like how all threads in a CUDA thread block run on the same core. In order for an ISPC program to use a GPU efficiently, it must launch multiple tasks so that multiple warps are executed concurrently on the many cores available on the GPU. Additionally just like how there are no guarantees about the order in which CUDA threads blocks get executed, there also are no guarantees about the order in which the ISPC tasks will execute.

Another similarity between ISPC and CUDA is that program execution continues asynchronously after calling launch in ISPC just like how program execution continues asynchronously after launching a kernel in CUDA. Thus in both cases, if you wish to use the values computed by the kernel function or by the launch function, you much call sync or cudaSyncThreads. (Note however that in ISPC, program execution continues synchronously after launching just one gang of ISPC tasks).

landuo

CUDA is a data parallelism programming model, because CUDA applies the same function on multiple data. The execution of CUDA program it self is an shared address space model. As mentioned before in class, a GPU running CUDA program has a shared memory and each core has its own cache. As @Keptyn said, copying data from host to device and device to host is an example of message passing model. Gang of ISPC is similar to a warp in CUDA, and ISPC with tasks is similar to the GPU schedules warps to execute.