Previous | Next --- Slide 24 of 65
Back to Lecture Thumbnails
alex

As it turns out, the explicit intrinsics are actually unnecessary in this case: for the sinx program on the previous slide, gcc -O3 detects the data parallelism and emits vector instructions:

0000000000400520 <sinx>:
  ... <snip> ...
  400575: 0f 59 c2              mulps  %xmm2,%xmm0
  400578: 0f 28 e2              movaps %xmm2,%xmm4
  40057b: 0f 28 da              movaps %xmm2,%xmm3
  40057e: 0f 59 e0              mulps  %xmm0,%xmm4
  400581: 0f 57 c5              xorps  %xmm5,%xmm0
  400584: 0f 59 dc              mulps  %xmm4,%xmm3
  400587: 41 0f 5e c2           divps  %xmm10,%xmm0
  ...

This only works on simple programs and gcc -O2 does not generate vector instructions for sqrt.

akashr

So what happens when this code actually runs on a cpu that does not support vector instructions? Does it just run each instruction that many times to get the computation done?

kayvonf

@akashr: The processor throws an invalid opcode exception and your program is terminated. This is easy to see for yourself: Compile the ISPC programs from Assignment 1 using the --target=avx-x2 flag (you must edit the Makefile) and try and run the resulting binary on the 5201/5205 machines that do not support AVX instructions.

bourne

Is it possible to have a compiler or OS figure out how large to make the vector based on the number of ALUs instead of hard coding 8?

sfackler

@bourne: GCC can (sort of) do it too, as @alex mentioned. Something like ISPC can do a better job. It's not really the OS's job to rewrite a program being executed at runtime.

briandecost

@bourne: to add to what @sfackler said, the way I see it, you're not hard-coding the 8 so much as you're hard-coding the entire function, because the compiler isn't yet good enough to vectorize your code for you, as @alex mentions regarding sqrt. I understand vector intrinsics as a sort of happy medium between regular C code and handcoded assembly. It's a sacrifice of portability for performance