Original writeup by sfackler from the spring 2013 course.
Devices with compute capability 2.x or higher support calls to printf
from within a CUDA kernel. 1 (You must be using CUDA version 3.1 or higher). Here's a small example:
#include <stdio.h>
__global__ void print_kernel() {
printf("Hello from block %d, thread %d\n", blockIdx.x, threadIdx.x);
}
int main() {
print_kernel<<<10, 10>>>();
cudaDeviceSynchronize();
}
You'll need to specify to nvcc
that you're compiling with support for compute capability 2.0 with the -arch
flag or the program will fail to compile:
nvcc -arch compute_20 printf.cu
An important thing to note is that every CUDA thread will call printf
. In this example, we'll see 100 lines of output!
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Hello from block 1, thread 4
Hello from block 1, thread 5
....
Hello from block 8, thread 3
Hello from block 8, thread 4
Hello from block 8, thread 5
Hello from block 8, thread 6
Hello from block 8, thread 7
Hello from block 8, thread 8
Hello from block 8, thread 9
It's generally a good idea to limit the number of threads calling printf
to avoid getting spammed.
if (threadIdx.x == 0) {
printf(...);
}
Some important notes:
printf
output is stored in a circular buffer of a fixed size. If the buffer fills, old output will be overwritten. The buffer's size defaults to 1MB and can be configured withcudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t size)
.This buffer is flushed only for
- the start of a kernel launch
- synchronization (e.g.
cudaDeviceSynchronize()
) - blocking memory copies (e.g.
cudaMemcpy(...)
) - module load/unload
- context destruction
An important thing to note is that this list does not include program exit. If the call to
cudaDeviceSynchronize()
was removed from the example program above, the we would see no output.