Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

lament

For my future reference (because even now I question my recollection from lecture), could someone post with an explanation of the intrinsic macros shown on this slide?

While it is always good to know something, are we expected to recall these things for later?

arjunh

@lament Most of these are relatively self-explanatory:

_mm256_load_ps is used to load a vector of data from a range of addresses
_mm256_div_ps/_mm256_add_ps, _mm256_mul_ps are used to perform arithmetic on vectors (division, addition, multiplication)
_mm256_set1_ps is used to create a vector of constants
_mm256_broadcast loads a single float from memory and stores it into all vector elements.

The 'ps' means 'single precision'. The 256 refers to the no. bits loaded in, which is 32 bits/float * 8 floats/vector or 8-wide vectors, in other words, as expected from AVX.

That's the gist of what's going on, but the good news is that you're not expected to know these the AVX/SSE intrinsics at all! You are expected to understand how vectorized code works, though, and you will have to (roughly) know how to transform a program into a vectorized one (you'll be doing that on assignment 1).

lament

Great! Thank you.

kayvonf

Quick correction on some of the comments above: 'ps' stands for "packed scalar".

_mm256_load_ps: a vector load of 256 bits (eight fp32 elements) where the elements to be loaded are all packed contiguously in memory. It is also possible to collect fp32 values from non-contiguous memory addresses into a single vector register. This type of more sophisticated, and certainly more expensive, type of vector register load operation is called a "gather". See the definition of _mm256_i32gather_ps in the Intel Intrinsics Guide.

One additional note, _mm256_load_ps requires 16-byte-aligned addresses. It is an invalid operation otherwise (your code will core dump). You you cannot prove statically that memory address holding the vector will be appropriately aligned, you will instead need to use the "unaligned memory" version _mm256_loadu_ps, which may cost performance.

Those of you that want to learn more might enjoy browsing the Intel Intrinsics Guide. It's quite well done.

russt17

MATLAB is particularly useful because it can do pointwise operations between matrices very quickly. Is MATLAB just calling these C intrinsics behind the scenes and getting its speed boost from modern SIMD architecture?

If so, could you write a library in Java to get the same speed benefit as MATLAB on point-wise matrix operations?

hohoho

Can we understand the vectorized code similiar to we understand the sequence in SML? We could do a batch of operation on every element in the vector like a "map" for the sequence in SML?

cacay

I feel like most of these functions could be renamed to much more pleasing +, -, * etc using type information. Also, it should be easy to auto-vectorize functional code written using map, zipWith etc.

TA-lixf

@russt17, 1. I don't know how MATLAB is implemented underneath but I would guess that it needs to make use of those vector instructions. Note that people started speeding up programs in order to make scientific computing more efficient, and MATLAB seems to be quite related.. 2. You could parallelize JAVA, but is it necessary all the time? Maybe it is necessary when you are using JAVA to do matrix manipulation, but is it necessary when you are running a web server? -- No. In such cases, using vectorized instructions might even cost you more power usage or even computation time.

TA-lixf

@hohoho & @cacay, well, they [vector instructions and maps in SML] are similar but not exactly the same because they reside on different abstraction levels. For one thing, maybe a function like map + seqA can easily be implemented in vector code, but how about map ComplicatedFunction seqB? In order to implement it, the compiler will need to break it down and make the decision about which small parts of it can be parallelized and then choose the correct ISA to issue them to the processing units. So, yes, they are similar, but they are different by nature: one is a programming language construct while the other is essentially an ISA.

doodooloo

@hohoho, I also see the vector like the sequence in SML, but I am not 100 percent sure. However, one thing that is definitely different between the vector and the sequence is the vector is more strict about the size of its element.

vrkrishn

@russt17

When you install MATLAB, it builds native C libraries that are optimized for vectorization on your system. When you used a built in function call in MATLAB, it will actually execute a native C function that uses SIMD vectorization. This is why you can write a super optimized MATLAB function and still have it be slower than a builtin.

abist

I'm confused as to what a "vector" is. Is it actually just an array like data structure, with a given starting pointer?

kayvonf

In this program vector refers to the __mm256 datatype. This variable corresponds to the contents of a 256-bit register that is interpreted as a length-8 vector of single-precision floating point numbers.

yuel1

It's worth noting, that sequences in SML are a datatype, whereas the vectors we are referring to correlate to actual hardware registers as prof Kayvon has mentioned above.

A question that I have is, do the AVX intrinsics force the CPU to execute the instructions using SIMD instructions? Do issues arise when you try to compile this program on a CPU that has different numbers of ALUs and vector widths?

kayvonf

Yes, there is a big difference between a sequence (of arbitrary length) and a statically-known fixed-length vector as described here. The "sequences" or collections you've dealt with in 15-210 or more akin to STL vectors in C++ or the arrays x and result in the code above.

Kapteyn

why doesn't avx support 256 bit integer operations?

fgomezfr

@yuel "Do the AVX intrinsics force the CPU to execute the instructions using SIMD instructions?"

Yes - Intel intrinsics are just C/C++ syntactic wrappers around assembly instructions. The datatypes it provides allow you to manipulate SIMD vector-width values outside of registers, but the compiler has to emit the proper instructions to do this - it just knows how an when to move things into and out of registers, it doesn't have an alternative scalar implementation that can run outside of SIMD, and it certainly doesn't allow the CPU to choose dynamically. The intrinsic functions will be translated directly to the corresponding assembly instruction listed on the Intel Intrinsics Guide.

"Do issues arise when you try to compile this program on a CPU that has different numbers of ALUs and vector widths?"

You may be able to compile, but you can only run the code if your CPU supports the AVX instruction set targeted by the compiler. Keep in mind these are Intel-specific extensions to the x86-64 ISA, so all Intel CPU's will support some subset of the extensions (SSE, SSE2/3/4, AVX, AVX512, etc). Intel specifies the hardware that these instructions map to and the number of XMM/YMM registers in a CPU, so the code will run the same on any machine. There is no concept of 'different vector widths' or 'different number of ALU' - this would be like changing the width of ?x or changing the number of registers to only go up to %r4. The slight exception is that AVX512 requires 512-bit registers, regular AVX requires 256-bit registers, SSE requires only 128-bit registers; these are overlapped in newer CPU's, similar to how you can use a 32-bit register in the low bits of a 64-bit register.

@Kapteyn "Why doesn't avx support 256 bit integer operations?"

First note that having a 256-bit register doesn't mean you have the circuits to perform 256-bit operations. Intel assumes you want 32-bit integer/float operations most of the time, and built circuits roughly akin to four 32-bit scalar processors reading simultaneously from four 32-bit inputs, except that their inputs come from a single register forcing the programmer to synchronize their code for the SIMD lock-step execution model, and allowing the same register to be re-used for different input layouts.

There is not a lot of need yet for integers larger than 64 bits; a quick search in the Intel Intrinsics Guide reveals no actual math operations on 128-bit integers, just a lot of bitwise operations allowing you to manipulate whole registers and support for some cryptography operations that work on larger inputs. When / if it becomes useful enough that people will pay for it, I'm sure they will add it - we didn't always have AVX operations either :)