Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Previous | Next --- Slide 31 of 59

trappedin418

To clarify: the work-efficient scan is O(n) but the tree-like iterative structure of the algorithm cannot efficiently utilize SIMD because the iterations work on disjointed sections of the array with different sizes at each iterations.

This less work efficient warp_scan takes more advantage of the SIMD instructions by keeping as many lanes busy at times, even if technically some of the work is being "duplicated".

Lesson to learn: intuitions about asymptotic work might not translate well into real hardware.

pkoenig10

I think this was very clearly displayed in Assignment 2. More complex solutions that attempted to minimize work done using parallelization often performed poorly compared to solutions that did more work, but were a better fit for the CUDA programming model. This reminds me of slide 21 but in reverse; the more complicated solution maybe have spent less time doing relevant computations, but the overhead resulted in an overall slower solution.

bojianh

For SIMD units, the maximum calculation output is #SIMD units * span. So the span of the program really mattered here.