Question: I have never tried AVX yet, therefore when reading I found that the line __mm256_broadcast_ss(6) looks weird somehow... According to Intel's documentation it is "Loads and broadcasts 256/128-bit scalar single-precision floating point values to a 256/128-bit destination operand.". But what exactly does broadcast mean?
This comment was marked helpful 0 times.
kayvonf
In vector programming a broadcast (or a "splat") is taking a single value, say x and writing it to all lanes of the vector unit. In Assignment 1, the 15418 fake vector intrinsic _cmu418_vset_float is a form of a broadcast.
Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.
Operation pseudocode (consider dst to be a vector register):
tmp[31:0] = MEM[mem_addr+31:mem_addr]
FOR j := 0 to 7
i := j*32
dst[i+31:i] := tmp[31:0]
ENDFOR
dst[MAX:256] := 0
This comment was marked helpful 0 times.
ycp
So is this kind of code specific to the machines that you will be running it on and the processors it has? In other words, would vector programming translate across machines?
This comment was marked helpful 1 times.
kkz
I'm interested to know this (above) as well. I would guess that the SIMD implementation cannot be done through the OS interface?
Edit: Just curious, so we use __mm256_broadcast_ss(6) to write eight copies of 6 across the vector. How would we go about writing something like (1,2,3,4,5,6,7,8)? Must we load the values into the array and call _mm256_load_ps()?
This comment was marked helpful 0 times.
kayvonf
@ycp: Great observation. A binary with vector instructions in it only runs on a machine that supports those instructions. Since the binary is emitted assuming a particular width it is not portable to another machine with vector instructions of a different sort. For example, a program compiled using SSE4 in 2010 won't be able to take advantage of AVX instructions in a later chip. Current programs using AVX can't take advantage of the 16-wide AVX-512 instructions available on the 60-core Intel Xeon Phi processors.
This is one reason that GPUs have adopted in "implicit SIMD" implementation strategy, where the instructions themselves are scalar, but are replicated across N ALUs under-the-hood by the hardware. See Lecture 2, slide. This theoretically allows GPU's to change their SIMD width from generation to generation without requiring change to binaries. There are positives and negatives to both strategies.
This comment was marked helpful 0 times.
kayvonf
Question: someone should go figure out the answer to @kkz's question.
This comment was marked helpful 0 times.
mchoquet
@kkz: for the vector question, you should use _mm256_set_epi32(8,7,6,5,4,3,2,1).
I feel like SIMD is the compiler's responsibility, not the operating system's. The OS just runs assembly; it doesn't generate any. The one case I can think of is virtualization: if you're writing a VM that simulates a machine with vector assembly instructions, you might have to handle those at the OS level.
This comment was marked helpful 2 times.
rokhinip
I actually extended my compiler from 15-411 to execute SIMD instructions and I'd definitely agree that SIMD is the compiler's responsibility more than the OS's. However, detecting the parallelism across loops and such is a difficult problem and one might therefore also need constructs in the language, similar to OpenMP, so as to aid the compiler to make this optimization.
Question: I have never tried AVX yet, therefore when reading I found that the line
__mm256_broadcast_ss(6)
looks weird somehow... According to Intel's documentation it is "Loads and broadcasts 256/128-bit scalar single-precision floating point values to a 256/128-bit destination operand.". But what exactly does broadcast mean?This comment was marked helpful 0 times.
In vector programming a broadcast (or a "splat") is taking a single value, say
x
and writing it to all lanes of the vector unit. In Assignment 1, the 15418 fake vector intrinsic_cmu418_vset_float
is a form of a broadcast.You may find the following useful: http://software.intel.com/sites/landingpage/IntrinsicsGuide
Here is the definition of
_mm256_broadcast_ss
:Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.
Operation pseudocode (consider
dst
to be a vector register):This comment was marked helpful 0 times.
So is this kind of code specific to the machines that you will be running it on and the processors it has? In other words, would vector programming translate across machines?
This comment was marked helpful 1 times.
I'm interested to know this (above) as well. I would guess that the SIMD implementation cannot be done through the OS interface?
Edit: Just curious, so we use
__mm256_broadcast_ss(6)
to write eight copies of 6 across the vector. How would we go about writing something like (1,2,3,4,5,6,7,8)? Must we load the values into the array and call_mm256_load_ps()
?This comment was marked helpful 0 times.
@ycp: Great observation. A binary with vector instructions in it only runs on a machine that supports those instructions. Since the binary is emitted assuming a particular width it is not portable to another machine with vector instructions of a different sort. For example, a program compiled using SSE4 in 2010 won't be able to take advantage of AVX instructions in a later chip. Current programs using AVX can't take advantage of the 16-wide AVX-512 instructions available on the 60-core Intel Xeon Phi processors.
This is one reason that GPUs have adopted in "implicit SIMD" implementation strategy, where the instructions themselves are scalar, but are replicated across N ALUs under-the-hood by the hardware. See Lecture 2, slide. This theoretically allows GPU's to change their SIMD width from generation to generation without requiring change to binaries. There are positives and negatives to both strategies.
This comment was marked helpful 0 times.
Question: someone should go figure out the answer to @kkz's question.
This comment was marked helpful 0 times.
@kkz: for the vector question, you should use
_mm256_set_epi32(8,7,6,5,4,3,2,1)
.I feel like SIMD is the compiler's responsibility, not the operating system's. The OS just runs assembly; it doesn't generate any. The one case I can think of is virtualization: if you're writing a VM that simulates a machine with vector assembly instructions, you might have to handle those at the OS level.
This comment was marked helpful 2 times.
I actually extended my compiler from 15-411 to execute SIMD instructions and I'd definitely agree that SIMD is the compiler's responsibility more than the OS's. However, detecting the parallelism across loops and such is a difficult problem and one might therefore also need constructs in the language, similar to OpenMP, so as to aid the compiler to make this optimization.
This comment was marked helpful 0 times.