Work - Around 3, since there are 2 adds and 1 copy.
Span - 2, since there are upsweep and downsweep (lgN+lgN)
haoala
While the work-efficient scan algorithm has better work (asymptotically) than the one two slides ago (O(N) < O(N lg N)), its constant for span is larger by a factor of approximately two, since the upsweep and downsweep stages each have similar span to that of the previous algorithm.
Metalbird
I think the locality is fairly good for this algorithm. At least in the first step of the up-sweep and the last step of the down-sweep, values are being added to values near them in storage. During the middle iterations, there may be some locality issues if we have to bring a whole new line into the cache for every value that we would need, but I may be mistaken on this point.
cluo1
This algorithm has less work to do but a larger span: O(2* lgN). If the array is small enough so that all the elements can be concurrently executed on GPU, the run time will be dominated by span. As the result, this algorithm will run slower.
sadkins
The recursive implementation of scan has inefficient work compared to the looped implementation, though it has logarithmic span. Span calculations assume an infinite number of processors(which isn't practical) and don't consider overhead communication costs. Therefore good span does not necessarily mean the algorithm will perform well in practice
Constants:
While the work-efficient scan algorithm has better work (asymptotically) than the one two slides ago (O(N) < O(N lg N)), its constant for span is larger by a factor of approximately two, since the upsweep and downsweep stages each have similar span to that of the previous algorithm.
I think the locality is fairly good for this algorithm. At least in the first step of the up-sweep and the last step of the down-sweep, values are being added to values near them in storage. During the middle iterations, there may be some locality issues if we have to bring a whole new line into the cache for every value that we would need, but I may be mistaken on this point.
This algorithm has less work to do but a larger span: O(2* lgN). If the array is small enough so that all the elements can be concurrently executed on GPU, the run time will be dominated by span. As the result, this algorithm will run slower.
The recursive implementation of scan has inefficient work compared to the looped implementation, though it has logarithmic span. Span calculations assume an infinite number of processors(which isn't practical) and don't consider overhead communication costs. Therefore good span does not necessarily mean the algorithm will perform well in practice