Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

meatie

Can host access CUDA memory? In other words, say there is an array allocated in device memory, can host read array elements? Will it be a segmentation fault if host try to read device memory?

srw

@meatie No, it cannot. The host cannot directly manipulate the contents of the device memory. See the comment at the bottom of lecture 5, slide 31.

As far as I can tell, GPU memory is in some remote location normally inaccessible to CPUs. This is why memcpy is compared to message passing.

sgbowen

Is it possible to use cudaMemcpy to copy from global memory to shared memory? Does that make any sense? And is this faster/slower than copying it cooperatively using threads?

kayvonf

@sgbowen. There is no built-in global to shared memcpy primitive in CUDA, but it would be a useful helper library function. If there was, its implementation on current GPUs would almost certainly be the cooperative load as implemented here.

jezimmer

What is the difference between using cudaMalloc and then initializing the memory versus initializing the memory and then using cudaMemcpy?

rbandlam

Can someone please explain what first red block code is doing? I mean the red block that cooperatively loads blocks support region. Why there is a if condition and what we are doing inside that.

kayvonf

@rbandlam. The 128 threads in the block are cooperatively loading the 130 required input elements. Each thread loads one element. Then two of the threads load the final 2 elements.

Sherry

Maybe it's better to remove the first parameter (int N) from function 'convolve'. The parameter is not used, and it confused me a little at first glimpse as a kernel function is supposed to process one piece of data instead of a range of (which 'int N' indicates).

squashme

What's the point of the shared memory? Is it faster to copy the memory into the shared block and then use it than to just use access it from the regular gpu storage?

plymouth

@squashme If you're just doing one read and one store, then no it's not faster. But, in this example, each thread does three reads of the input (support) array. Since using shared memory is around 100x faster than global gpu memory, using shared memory drastically reduces the cost of the second and third read.

http://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/