Question: This slide states that the work-efficient version of scan would require more instructions than this version. Why is this the case?
ericwang
In this context, we have 32-wide SIMD execution for 32 elements. So one instruction can finish one round calculation for all 32 elements. So around lg(N) instructions would be enough for this implementation.
Then in work-efficient version, SIMD execution can still handle maximum 32 elements one time. But up-sweep plus donw-sweep will require 2*lg(N) rounds. Actually this version's implementation never uses more than 50% SIMD capability. The utilization is about half of the one in this slide.
As a result, more than 2x instructions would be needed.
Question: This slide states that the work-efficient version of scan would require more instructions than this version. Why is this the case?
In this context, we have 32-wide SIMD execution for 32 elements. So one instruction can finish one round calculation for all 32 elements. So around lg(N) instructions would be enough for this implementation.
Then in work-efficient version, SIMD execution can still handle maximum 32 elements one time. But up-sweep plus donw-sweep will require 2*lg(N) rounds. Actually this version's implementation never uses more than 50% SIMD capability. The utilization is about half of the one in this slide.
As a result, more than 2x instructions would be needed.