Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

A Modern Multi-Core Processor

Previous | Next --- Slide 28 of 79

Back to Lecture Thumbnails

yimmyz

What the code above does can be described with the pseudocode:

> for 8-element chunks in x

> load 8-element chunk into 256-bit variable origx

> instantiate 8-element chunk to store results

> calculate numerators and denominators of the 8-element chunk

> save results into the original array

Specifically, __mm256_mul_ps means vectorized multiplication, and __mm256_div_ps means vectorized division, and __mm256_add_ps means vectorized summation.

365sleeping

Note that AVX intrinsics, like _mm256_load_ps(&x[i]), assume the address, i.e. x + i, is 32-byte-aligned; otherwise, there will be a segment fault. Similarly, for SSE intrinsics, like _mm_load_ps, the address should be 16 bytes aligned.