Why do idx & 31? If all idx <= 31, this has no effect. And if, for example, we have idx=16 and idx=48, then we'll be doing redundant work.
EDIT: Oh, I get it, it makes sense when you consider that threads 0-31 are warp 1, threads 32-63 are warp 2, and so on... AND that only one warp will be doing this at one ptr at a time
Why do idx & 31? If all idx <= 31, this has no effect. And if, for example, we have idx=16 and idx=48, then we'll be doing redundant work.
EDIT: Oh, I get it, it makes sense when you consider that threads 0-31 are warp 1, threads 32-63 are warp 2, and so on... AND that only one warp will be doing this at one ptr at a time