Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Previous | Next --- Slide 35 of 66

xwenwenwen

Could someone tell me more specifically about SIMD width? (is there any similarity/relationship between it and x-wide vectors?) I searched on the internet and the only formal definition I found is "The number of working items processed by a GPU thread". So, seems like it is only related to GPUs, not CPUs? I'm really confused about this when doing the homework.

kayvonf

We say SIMD instructions are "N-wide" when they operate on length-N vectors. Both modern CPUs and GPUs employ the idea of SIMD processing to improve efficiency of processing.

sam

In very simpler terms what i understand of explicit and implicit SIMD is that in explicit SIMD compiler does the mapping of data to SIMD units(this is like static branch prediction) while in implicit SIMD hardware does the mapping of data to SIMD units (this is like dynamic branch prediction). But in both the cases we write the program in same way using intrinsics(that is explicit or implicit has no effect on user). Am i right here ? Also how is it decided that implicit or explicit SIMD should be used ?

kayvonf

@sam: The first part of your statement was exactly correct. However, the second part I'd like to respond to.

In explicit SIMD programming the program binary has SIMD instructions in it. Just like any other instruction, if the processor sees an SSE/AVX instruction in an instruction stream, it decodes it and tells the SIMD unit (aka the SIMD ALUs) to do its thing. This is how SIMD processing is controlled on x86.

In implicit SIMD, the ISA does not have vector instructions in it. That is, there's no such thing as vector add instruction. Thus, the program binary is a scalar program describing scalar operations on scalar registers (add two scalar registers). However, unlike x86 where the OS tells the processor "run this program", an in implicit SIMD architecture, the processor is told "run N copies of this program", typically, where N is a large number. Then the architecture can decide how many instances of the program to run simultaneously, based on its SIMD execution width.

In other words, the interface to the hardware is data-parallel. Did that help?

It makes no sense to write a program with vector intrinsics on an implicit SIMD machine, since there are no SIMD instructions to translate those intrinsics into. However, a program written in a high-level data-parallel language (see lecture 3) can easily be compiled to either an explicit SIMD or implicit SIMD ISA.

sam

@kayvonf Yes it helped. Thanks for explaining.

aznshodan

Since explicit SIMD instructions contain vector instructions in the ISA, as programmers we can use the vector instructions do create SIMD processing with single instruction stream with multiple data. However for implicit SIMD, I'm not too sure what you mean by "run N copies of this program". Does that mean implicit SIMD only works when there are N copies of the program with same data? Or can it work with different data as well? Because programs don't often run multiple times with same data.