Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

yimmyz

I will again try to explain the function "sumall2" above with pseudocode:

> initialize partial sum vector "partial" to zeros

> for every eight elements in array $x$, load into a temp vector, and perform vector-add with partial vector

> copy the partial vector out into array form

> calculate the sum of partials as total sum

Funky9000

I think there is a difference between the 2 code. In the ISPC version, partial sums consists of arbitrary indexed values (foreach) choosen by the compiler (I think) while in the AVX version, partial sums are blocks of contiguous values.

lol

@Funky9000 Yeah, that's correct. As the distribution of work in the foreach is implementation defined, the partial sums in the second version can be sums of arbitrary subsets of indices. In the first, the gang operates on contiguous values, so the partial sums are of equally spaced indices.

It turns out that the current implementation of ISPC interleaves work among the gang as shown before, so the two are the same.

365sleeping

Will the ISPC code on top right be mapped to the AVX intrinsics on bottom left? I believed it would not until I found the following code:

 /* good ispc implementation of a sum reduction */
 uniform float sum(const uniform float array[], uniform int count) {
   float sum = 0;
   foreach (i = 0 ... count)
      sum += array[i+programIndex];
   return reduce_add(sum);
 }

Can someone explain to me why this code would be equivalent to the code below? Thanks!

 /* C implementation of a sum reduction */
 float sum(const float array[], int count) {
   float sum = 0;
   for (int i = 0; i < count; ++i)
     sum += array[i];
   return sum;
 }

PS: I am sorry I don't know how to control the format of the code. I'd appreciate if someone help me out.

jhibshma

I'm guessing that like the foreach loop, the line sum = reduce_add(partial); is executed once by teamwork rather than once in each gangmember.

Since the adding can be done in a binary-tree like style, does reduce_add also use SIMD instructions?

xingdaz

I don't know what the rest of you thinks, but the AVX implementation makes more sense to me than the ISPC version; you know exactly what is happening on the hardware level.

patrickbot

I think that's where we see a difference between an abstraction and implementation. The AVX code is more like the implementation -- you know exactly what is going on at each index. The ISPC code is at a higher level; the foreach keyword is an abstraction. As a reader/writer of the code, you're trusting the compiler to correctly split up the indices in the best way. However, you don't actually know how this is done, and as it has been pointed out by lol, it depends on the implementation of foreach. As a result, the ISPC code might make less sense because there's more magic and it is less explicit about what it is doing, but that's because there is a lot of meaning behind the foreach keyword.

c0d3r

I'm wondering why the _mm256_store_ps(tmp, partial) was necessary. Why couldn't we keep using partial instead of copying its contents to tmp?

yimmyz

@c0d3r Ahh, check the type of "partial" -- it is __mm256, i.e. a "blob" / vector of 256 bytes. Standard C(++) does not support this data type, and we have to copy it out into standard data structure first.

mallocanswer

@365sleeping Hi, could you talk a little bit more about the code of good implementation of a sum reduction. Suppose we have two program instances. Instance one's programIndex is 0, so it will sum the array from 0 to count. Similarly, Instance two will sum the array from 1 to count + 1. So finally, we will get $$\sum_{i=0}^{count} array[i] + \sum_{i=1}^{count+1} array[i]$$? Please correct me if I am wrong

365sleeping

@mallocanswer Yes, that is where I am confused as the results would be different. The code is found at https://ispc.github.io/perfguide.html. It is also likely that I misunderstood the scenario.

yangwu

@mallocanswer I think foreach is an abstraction here. foreach (i = 0 ... count) compiled to be a iteration with step of programCount rather than 1. So if we have 2 program instances, one would add all odd indices and the other sum up even ones.

And just to clarify: are we choosing an array of size 8 at bottom-left because we are using a machine with 8 ALUs?

mallocanswer

@yangwu Hi, I did some experiments but the result differed from what I expected. Could you help me figure out what is going on here?

ispc code:

 export uniform float sum(const uniform float array[], uniform int count) {
    float sum = 0.f;
    foreach(i = 0 ... count) {
            sum += array[i+programIndex];
    }
    return reduce_add(sum);
 }

main function:

 float sumarray[8] = {1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
 float result = sum(sumarray, 8);
 printf("result:%f\n", result);

result is 4.0 (supposed to be 8.0)

yangwu

@mallocanswer I tested with sum += array[i] the program gives 8.0.

so my guess is ispc will do the 'jump' for you, by adding programIndex actually skip half of array. and as for the sample code, probably they compiled with different flags..