Work = 32 threads performing 5 instructions each = (5 * 32) = N log N. Although this implementation is not work efficient, it is more beneficial in the 32-wide SIMD execution. Thus, we increase the total work in order to reduce overall span (5 = log N).
firebb
One thing caught my attention here is that we don't synchronize after each step. Is it because we are using only one SIMD unit, so all threads all mechanically synchronized?
axiao
Yeah I think it's because the code assumes that the array is of size 32 and that scan_warp is run by a group of 32 CUDA threads on the same warp, which uses 32-wide SIMD execution. Since SIMD execution proceeds in lockstep on a single instruction stream, no synchronization is needed here.
Master
Note that this function implements inclusive scan but returns exclusive results.
rrudolph
If you look at the implementation given to us in assignment 2, threads are synchronized between each pass. This is likely needed because that implementation is given as a function to be called from several kernels, and may not be executing in perfect SIMD lockstep.
yes
@Master I see that it returns exclusive results, but what do you mean that it implements inclusive scan? Seems like it's not possible to get the results of an inclusive scan given the exclusive scan result.
Master
@Yes You can see it from the graph. And there is a comment in the slide as well.
yes
I see, I must have misread the two lines and switched the exclusive/inclusive words between them.
Work = 32 threads performing 5 instructions each = (5 * 32) = N log N. Although this implementation is not work efficient, it is more beneficial in the 32-wide SIMD execution. Thus, we increase the total work in order to reduce overall span (5 = log N).
One thing caught my attention here is that we don't synchronize after each step. Is it because we are using only one SIMD unit, so all threads all mechanically synchronized?
Yeah I think it's because the code assumes that the array is of size 32 and that scan_warp is run by a group of 32 CUDA threads on the same warp, which uses 32-wide SIMD execution. Since SIMD execution proceeds in lockstep on a single instruction stream, no synchronization is needed here.
Note that this function implements inclusive scan but returns exclusive results.
If you look at the implementation given to us in assignment 2, threads are synchronized between each pass. This is likely needed because that implementation is given as a function to be called from several kernels, and may not be executing in perfect SIMD lockstep.
@Master I see that it returns exclusive results, but what do you mean that it implements inclusive scan? Seems like it's not possible to get the results of an inclusive scan given the exclusive scan result.
@Yes You can see it from the graph. And there is a comment in the slide as well.
I see, I must have misread the two lines and switched the exclusive/inclusive words between them.