How to Print from CUDA : Parallel Computer Architecture and Programming : 15-418/618 Fall 2017

Original writeup by sfackler from the spring 2013 course.

Devices with compute capability 2.x or higher support calls to printf from within a CUDA kernel. ¹ (You must be using CUDA version 3.1 or higher). Here's a small example:

#include <stdio.h>

__global__ void print_kernel() {
    printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x);
}

int main() {
    print_kernel<<<10, 10>>>();
    cudaDeviceSynchronize();
}

You'll need to specify to nvcc that you're compiling with support for compute capability 2.0 with the -arch flag or the program will fail to compile:

nvcc -arch compute_20 printf.cu

An important thing to note is that every CUDA thread will call printf. In this example, we'll see 100 lines of output!

Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Hello from block 1, thread 4
Hello from block 1, thread 5
....
Hello from block 8, thread 3
Hello from block 8, thread 4
Hello from block 8, thread 5
Hello from block 8, thread 6
Hello from block 8, thread 7
Hello from block 8, thread 8
Hello from block 8, thread 9

It's generally a good idea to limit the number of threads calling printf to avoid getting spammed.

if (threadIdx.x == 0) {
    printf(...);
}

Some important notes:

printf output is stored in a circular buffer of a fixed size. If the buffer fills, old output will be overwritten. The buffer's size defaults to 1MB and can be configured with cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t size).
This buffer is flushed only for
- the start of a kernel launch
- synchronization (e.g. cudaDeviceSynchronize())
- blocking memory copies (e.g. cudaMemcpy(...))
- module load/unload
- context destruction
An important thing to note is that this list does not include program exit. If the call to cudaDeviceSynchronize() was removed from the example program above, the we would see no output.

See official documentation here ↩