What the code above does can be described with the pseudocode:
> for 8-element chunks in x
> load 8-element chunk into 256-bit variable origx
> instantiate 8-element chunk to store results
> calculate numerators and denominators of the 8-element chunk
> save results into the original array
Specifically, __mm256_mul_ps means vectorized multiplication, and __mm256_div_ps means vectorized division, and __mm256_add_ps means vectorized summation.
365sleeping
Note that AVX intrinsics, like _mm256_load_ps(&x[i]), assume the address, i.e. x + i, is 32-byte-aligned; otherwise, there will be a segment fault.
Similarly, for SSE intrinsics, like _mm_load_ps, the address should be 16 bytes aligned.
What the code above does can be described with the pseudocode:
> for 8-element chunks in x
> load 8-element chunk into 256-bit variable origx
> instantiate 8-element chunk to store results
> calculate numerators and denominators of the 8-element chunk
> save results into the original array
Specifically,
__mm256_mul_ps
means vectorized multiplication, and__mm256_div_ps
means vectorized division, and__mm256_add_ps
means vectorized summation.Note that AVX intrinsics, like
_mm256_load_ps(&x[i])
, assume the address, i.e.x + i
, is 32-byte-aligned; otherwise, there will be a segment fault. Similarly, for SSE intrinsics, like_mm_load_ps
, the address should be 16 bytes aligned.