Previous | Next --- Slide 32 of 63
Back to Lecture Thumbnails
taoy1

This design (scan_warp and scan_block) is great! The problem is decomposed so that only warps do sequential scan. There is no synchronization needed inside a warp (as they execute in lock step). Also inside a block shared memory can be used and __syncthreads (block-level synchronization barrier) is not expensive.

Maybe other CUDA programs can be implemented in this way too, using a warp-level parallelism and a block-level parallelism.