Slide View : 15-418/618 Spring 2014

Previous | Next --- Slide 33 of 56

aaguilet

The problem here is that we are locking i * j * k times per thread, where 'i' is the number of rows a thread has to process, 'j' is the number of red cells in each row, and 'k' is the number of iterations it takes to converge

A better approach (details on slide 35), is to compute a partial difference within the nested for loops, and after that add it to the shared 'diff' variable. Hence locking only once 'k' times, according to the previous notation

This comment was marked helpful 0 times.

pradeep

I had a small question regarding the use of barriers, may be I am confusing CUDA threads with task, but from our assignment 1 all the tasks ran the same instructions, and when you didn't want the instructions to affect you you just mask out your lane. Since, CUDA also follows SIMD, my question was won't all the CUDA threads exit the for loop at the same time if they were all running the same instructions and hence removing the need of the 2nd barrier ?

This comment was marked helpful 0 times.

kayvonf

Who can answer Pradeep's question? (I like Pradeep's question.) Anotehr way to think about this question is: Do all CUDA threads in a thread block execute in SIMD lockstep?

This comment was marked helpful 0 times.

pradeep

Yeah, I think I get it now. So just to lay down what I am thinking. For ex: suppose we have 256 CUDA threads/block we would need 8 warps to run them. Also in the NVIDIA chip described in class only 4 warps (128 threads) ran at any time, hence these 128 running in the first 4 warps would need to wait for the other 128 to actually run after them and finish executing. Is this the reason barriers are used ?

This comment was marked helpful 0 times.

ycp

@pradeep

I think you're right (unless your saying the first 128 threads run first and then the next 128 threads run). Think of an example like:

If processor decided to schedule the threads such that the first 128 ran to completion first and then the next 128 threads were run, then if there was a barrier there would be a deadlock. This is because the second set of threads would never make it to the barrier.

The processor is basically free to schedule the threads in any way it wants, but for the reason seen in the example, its generally interleaves them in some fashion.

This comment was marked helpful 0 times.