Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

Previous | Next --- Slide 32 of 63

taoy1

This design (scan_warp and scan_block) is great! The problem is decomposed so that only warps do sequential scan. There is no synchronization needed inside a warp (as they execute in lock step). Also inside a block shared memory can be used and __syncthreads (block-level synchronization barrier) is not expensive.

Maybe other CUDA programs can be implemented in this way too, using a warp-level parallelism and a block-level parallelism.