What the code above does can be described with the pseudocode:

> for 8-element chunks in x

> load 8-element chunk into 256-bit variable origx

> instantiate 8-element chunk to store results

> calculate numerators and denominators of the 8-element chunk

> save results into the original array

Specifically, __mm256_mul_ps means vectorized multiplication, and __mm256_div_ps means vectorized division, and __mm256_add_ps means vectorized summation.

365sleeping

Note that AVX intrinsics, like _mm256_load_ps(&x[i]), assume the address, i.e. x + i, is 32-byte-aligned; otherwise, there will be a segment fault.
Similarly, for SSE intrinsics, like _mm_load_ps, the address should be 16 bytes aligned.

What the code above does can be described with the pseudocode:

`> for 8-element chunks in x`

`> load 8-element chunk into 256-bit variable origx`

`> instantiate 8-element chunk to store results`

`> calculate numerators and denominators of the 8-element chunk`

`> save results into the original array`

Specifically,

`__mm256_mul_ps`

means vectorized multiplication, and`__mm256_div_ps`

means vectorized division, and`__mm256_add_ps`

means vectorized summation.Note that AVX intrinsics, like

`_mm256_load_ps(&x[i])`

, assume the address, i.e.`x + i`

, is 32-byte-aligned; otherwise, there will be a segment fault. Similarly, for SSE intrinsics, like`_mm_load_ps`

, the address should be 16 bytes aligned.