Previous | Next --- Slide 33 of 65
Back to Lecture Thumbnails
spilledmilk

Intel has also unveiled 512 bit AVX instructions as of July 2013, but it looks like the first of their processors that will support this technology has not been released yet (the Intel Xeon Phi processor and coprocessor codename "Knights Corner").

kayvonf

Yes, Xeon Phi is out.

http://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core

analysiser

@kayvonf: Question: I'm a bit confused about what does the 128-bit and 256-bit operation stands for? Like 8-wide float vectors, does it mean AVX enables 8 ALUs (or cores) to do float calculation simultaneously?

kayvonf

It's just the size of the vectors being operated on. An SSE instruction operates on 128-bit registers. Those registers can either hold four 32-bit values, or two 64-bit values. For example: an SSE instruction can add two length-4 vectors of single-precision floating point values in one clock.

An AVX instruction operates on 256-bit registers. So for example: an AVX instruction can add two length-8 vectors of single precision floating point values in one clock. Twice as much throughput!

Each of the _mm256 intrinsics on slide 24 corresponds to an AVX vector instruction executed by one core. In the terminology of this lecture, a vector execution unit made up of 8 parallel floating point ALUs (eight yellow boxes) in a single processor core executes this instruction. Alternatively, you might want to think about a vector execution unit as a single entity, with hardware in it to do eight operations at once.

tianyih

It seems to me that the SIMD optimization would have different effects on CPUs with different number of ALUs per core. So the programs should be recompiled again on every machine? In other words, the executable files have really bad portability?

yixinluo

@tianyih If the width of vector units (not number of ALUs, to eliminate confusion) on the CPU has changed, then you will need to recompile the code. To be more general, if any instruction (CPU instructions, not c statements) used by your program has a different functionality in the new CPU, you will have to recompile it. Fortunately this almost never happens to desktop CPUs after intel dominates. This is because intel keeps backward compatibility to all of their old instructions. As an example for this slide, a new intel CPU that supports 256 bit AVX instructions still supports 128 bit SSE instructions.

nbatliva

@kayvonf Im a little confused about the interaction between ISPC program instances/gangs and ALUs. Consider the program

float func() { 
    foreach(i = 0 ... 7) {
       // do computation
    }
}

Under SSE, say the values i = 0,1,2,3 are processed first and then i = 4,5,6,7. How are these values of i mapped to instances in a single gang? In my mind, it makes sense to say that on a single iteration of execution of the ALUs, a gang will process i = 0,1,2,3 simultaneously and then on the next iteration, the gang will process i = 4,5,6,7 simultaneously. But where does an individual program instances come into play here? Can a gang of program instances be viewed as a single thread? Or should each program instance in the gang be viewed as its own thread? Thanks

kayvonf

@tianyih: Good point. There's a discussion of code portability going on on Slide 24

@nbatliva: Any ISPC function call is always running as an SPMD program. That is, a gang of programCount ISPC program instances are running the logic defined by the ISPC function.

The foreach construct basically says. For each iteration in the defined iteration space, have one of the gang's program instances carry out the logic of the body of the for each. The foreach does not specify which program instance will, nor does it specify the order in which the iterations will be carried out. ISPC makes no other gaurantees.

Now, your question is about how is the foreach construction is actually implemented by the compiler? Well, we'd have to go look at the generated code to figure it out. However, I happen to know that the ISPC compiler will generate code that maps iterations to program instances in an interleaved fashion as was done here. To check your understanding, check out the question I ask on Lecture 3, slide 14.

jhhardin

I looked into compilers that find parallelism in for loops, and found this interesting article about automatic parallelization in C/C++ compilers by Intel:

http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers

The "Advice" section is particularly interesting, as it touches on the limitations. Even though it's certainly a hard problem, it seems like there is quite a bit of room for improvement! (Maybe us programmers won't always need to think about parallelism and will eventually let the compiler do a good deal of it for us?)