Previous | Next --- Slide 29 of 63
Back to Lecture Thumbnails
bob_sacamano

Why are there no synchronization issues in this code?

For instance, we have 32 threads executing the scan_warp function in parallel. Considering the "if (lane >= 1) ptr[idx] = ptr[idx - 1] + ptr[idx]" statement,

ThreadID - 1 would perform ptr[1] = ptr[0] + ptr[1] (stores into ptr[1]) ThreadID - 2 would perform ptr[2] = ptr[1] + ptr[2] (reads from ptr[1])

Wouldn't this be considered as a race on ptr[1]?

bpr

@bob_sacamano, I know that you mentioned it but start with the "When scan_warp is run by a group of 32 CUDA threads". Being the warp size, this is significant. If this code is run by a single warp of 32 threads, then those threads in the warp will have to execute in lock step with each other. But in general, your intuition is correct that (CUDA) threads have races when executing this code.

acortes

@bpr What do you mean the threads have to execute in lock step with each other?

bpr

@acortes, CUDA threads in a warp each execute the same instruction at the same time. In #54, the warp selector sends the selected instruction(s) to all of the function units, which all execute that instruction in that cycle(s).