Previous | Next --- Slide 17 of 52
Back to Lecture Thumbnails
kayvonf

This would be a good slide for someone to explain. How does this program work?

ycp

I guess the easier way to understand the code is to look at the ISPC, ISPC was created because the creator was frustrated with the verbose C+AVX intrinsics code after all. Basically, each member of the "gang" has its own value "partial" where it will store its partial sum. For however the load is split amongst the gang (probably is such a way that each gang member has a continuous block of memory to read from), each gang member adds the value at x[i] to its own value of partial. Eventually all the values in the array x will have been added into one of the partial sums. Afterwards, using "reduceAdd", all the partial values from each gang member is reduced into a single sum.

ycp

I guess I'll also take a crack at the AVX intrinsics too. So it looks like partial is some vector of 256 bits value, or the equivalent of the size of 8 floats. So in the loop, with _mm256_load_ps(&x[i]), the next 8 values of the array x that have not been looked thus far are loaded into a new vector, the same length as the vector partial. Then, using _mm256_add_ps, the 8 values of partial are updated by adding the corresponding value from the new vector that was just loaded. After the loop, all the values in the array x will have been added into one of the 8 values in the vector partial. Then at the end, each of the 8 values of the vector partial are summed into one cumulative sum.

analysiser

I am so confused about the

for (int i = 0; i < 8; i++)
    sum += partial[i];

part. I think the partial size is 8 because of it is declared as __mm256?

But in that case, _mm256_add_ps seems execute like:

partial[0] = x[0] + x[8] + x[16] + ...
partial[1] = x[1] + x[9] + x[17] + ...
...
partial[7] = x[7] + x[15] + x[23] + ...

But this seems like weird to me...

kayvonf

partial is a vector of eight elements. The i=0;i<8;i++ loop adds up all the elements in the vector.

The question is, what are the values of these eight elements? And given you're post, it looks like you've got it right.

analysiser

So as I see the description of _mm256_add_ps(__m256 m1, __m256 m2) is

Performs a SIMD addition of eight packed single-precision floating-point elements (float32 elements) in the first source vector, m1, with eight float32 elements in the second source vector, m2.

Therefore I think my guess should be right. partial should load from continuous memory and be added to a vector of eight values.

rokhinip

So in the ISPC program, assuming that we have a round-robin distribution of loop iterations to program instances, and assuming that we are working with SIMD vector width 8, we then have that $partial_i$ refers to the local partial variable of program instance $i$, $0 < i < 7$. Then $partial_i = x[i] + x[i + 8] + x[i + 16] + ...$.

Similarly, we see that in the C+ AVX intrinsics, partial is an 8 wide vector and as @analysiser points out, $partial[i] = x[i] + x[i + 8] + ...$. We can then see a parallel in the way the partial sums are computed in both.

The final for loop in the C+ AVX intrinsics which sums up the elements of the partial vector corresponds to the reduceAdd in the ISPC instance which sums up the results of the local partial variable of each program instance.