Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

khans

We use the first two threads to launch the extra elements that are needed by the last thread in the block.

planteurJMTLG

Can we do better than __syncthreads() with not too much communication overhead? Something like "only wait until support[threadIdx.x+1] and support[threadIdx.x+2] have been filled", instead of "wait until support has been completely filled"?

It would be useful especially if all threads can't be run concurrently.

jocelynh

@planteurJMTLG's questions brings another one to mind:

Is it possible for there to be a thread block with enough threads such that they cannot all be run concurrently on one core? And if that happens, how would __syncthreads() work, if at all?

kayvonf

@jocelynh -- Good question. Let's have that discussion on slide 75. And you might want to take a look at what's going on with the "persistent CUDA threads" example on slide 79.

paramecinm

If we use the implementation from last slide, the part that takes the most time is three memory loads and we can treat the running time of this block is approximately three memory loads. Because the first two threads need to do two memory loads and the whole block of threads are synchronized, the running time is two memory loads. So we save about 1/3 running time. And if the task is to convolve two elements rather than three, the implementation of this page won't save time (but it can still save memory bandwidth consumption).

nemo

Good point @paramecinm! Wouldn't it be better performance-wise then to compute 126 indices and load 128 values in one block?

shiyqw

This example shows the improvement by using shared address. I am wondering if there are some methods to improve the performance by using private address?

RomanArena

I think this is a good example of the contents in slides 46: we know that the threads may access the same variable, so we load the variable from global memory into the shared memory.

vadtani

@nemo I believe the code is written that one block would compute THREADS_PER_BLK(128 in this example) indices. If we compute 126 then how do you plan to accommodate the remaining 2 indices? Please correct me in case I misunderstood your point.

nemo

@vadtani - From my understanding you can load 128 values, which are good enough for 126 convolutions given each convolution uses 3 elements. You can use the threadIDs to control this.

manishj

@nemo I agree that in this example, 126 indices should give better performance. Here, 2 threads have to execute 2 load instructions and other threads need to wait for them before proceeding because of _syncthread() call. However, if the part of code that is written after __syncthread() was very computational intensive then 128 indices would be a better choice.

sandeep6189

The slide says that the barrier here is only able to synchronise threads in a block? How can we do the same for all threads across all the blocks?

kayvonf

@sandeep6189. A barrier primitive across all threads in all thread blocks does not exist. Why? (hint: slide 75)

Cake

We are not guaranteed that all blocks are running in parallel. They could be scheduled one after another, for example.

sandeep6189

@kayvonf I was wondering if our program is sequential and I would like to run it in GPU (for example, to harness the power of GPU compute in cases when my CPU is slow). Having such a barrier could help me write such a program in CUDA.

kayvonf

I agree with @cake's answer. Having such a primitive would highly constrain CUDA's implementation and it not consistent with the stream programming model CUDA adopts across its thread blocks. Thus it is not allowed.