As it turns out, the explicit intrinsics are actually unnecessary in this case: for the sinx program on the previous slide, gcc -O3 detects the data parallelism and emits vector instructions:
This only works on simple programs and gcc -O2 does not generate vector instructions for sqrt.
This comment was marked helpful 0 times.
akashr
So what happens when this code actually runs on a cpu that does not support vector instructions? Does it just run each instruction that many times to get the computation done?
This comment was marked helpful 0 times.
kayvonf
@akashr: The processor throws an invalid opcode exception and your program is terminated. This is easy to see for yourself: Compile the ISPC programs from Assignment 1 using the --target=avx-x2 flag (you must edit the Makefile) and try and run the resulting binary on the 5201/5205 machines that do not support AVX instructions.
This comment was marked helpful 0 times.
bourne
Is it possible to have a compiler or OS figure out how large to make the vector based on the number of ALUs instead of hard coding 8?
This comment was marked helpful 0 times.
sfackler
@bourne: GCC can (sort of) do it too, as @alex mentioned. Something like ISPC can do a better job. It's not really the OS's job to rewrite a program being executed at runtime.
This comment was marked helpful 0 times.
briandecost
@bourne: to add to what @sfackler said, the way I see it, you're not hard-coding the 8 so much as you're hard-coding the entire function, because the compiler isn't yet good enough to vectorize your code for you, as @alex mentions regarding sqrt. I understand vector intrinsics as a sort of happy medium between regular C code and handcoded assembly. It's a sacrifice of portability for performance
As it turns out, the explicit intrinsics are actually unnecessary in this case: for the
sinx
program on the previous slide,gcc -O3
detects the data parallelism and emits vector instructions:This only works on simple programs and
gcc -O2
does not generate vector instructions forsqrt
.This comment was marked helpful 0 times.
So what happens when this code actually runs on a cpu that does not support vector instructions? Does it just run each instruction that many times to get the computation done?
This comment was marked helpful 0 times.
@akashr: The processor throws an invalid opcode exception and your program is terminated. This is easy to see for yourself: Compile the ISPC programs from Assignment 1 using the
--target=avx-x2
flag (you must edit the Makefile) and try and run the resulting binary on the 5201/5205 machines that do not support AVX instructions.This comment was marked helpful 0 times.
Is it possible to have a compiler or OS figure out how large to make the vector based on the number of ALUs instead of hard coding 8?
This comment was marked helpful 0 times.
@bourne: GCC can (sort of) do it too, as @alex mentioned. Something like ISPC can do a better job. It's not really the OS's job to rewrite a program being executed at runtime.
This comment was marked helpful 0 times.
@bourne: to add to what @sfackler said, the way I see it, you're not hard-coding the 8 so much as you're hard-coding the entire function, because the compiler isn't yet good enough to vectorize your code for you, as @alex mentions regarding sqrt. I understand vector intrinsics as a sort of happy medium between regular C code and handcoded assembly. It's a sacrifice of portability for performance
This comment was marked helpful 0 times.