Previous | Next --- Slide 37 of 65
Back to Lecture Thumbnails
sbly

I'm a bit confused on the difference between SIMD and ILP. Specifically, how can we have ILP without multiple ALUs to execute the parallel instructions?

lixf

I believe the difference is that in SIMD, you are running the same instruction with multiple data (thus the name S(ingle)I(nstruction)M(ultiple)D(ata)). But in ILP (instruction level parallelism), you are executing several different instructions at the same time. And the CPU will use complicated logic to determine the dependencies and use a pre-fetcher to realize this.

snshah

I'm still a bit confused; once ILP determines what can be parallelized, how does it run them at the same time? Context-switching?

yixinluo

@snshah In the superscalar case, the processor exploits ILP by deciding which instructions are independent with each other (can be parallelized) within a single instruction stream. Then the processor feed independent instructions to different ALUs simultaneously and the ALUs can work in parallel. No context-switching is required because these instructions are from the same thread.

sbly

OK, so to exploit ILP, you do need multiple ALUs.

black

I am not sure about the usage of multi-core in this sin(x) example. When I used pthread before, I will gave each thread different start index, but I don't quite understand what's going on in this eample. silde 16

yixinluo

@black The multithread program is computing sin(x) for vector x using taylor's series expansion. The main thread launches another worker thread that computes the first args.N numbers in x. Then the main thread act as a worker thread and computes the rest. (The "// do work" line between the two pthread function)

kayvonf

A processor that supports superscalar execution is a processor that can fetch and decode multiple instructions per clock and then execute those instructions on different execution units (ALUs).

In the visual vocabulary of this lecture, a superscalar core would have multiple orange boxes.

Given a single thread of control, (where the program specifies no explicit parallelism), in order to use multiple execution units the processor must inspect the instruction stream to find instructions that don't depend on each other, and then it could execute those instructions in parallel. That is exploiting instruction-level parallelism in the instruction stream.

In contrast, consider an instruction stream with a SIMD vector instruction in it. The processor can fetch and decode this one instruction (one orange box), and then executes it on a SIMD execution unit (shown in my diagrams as a block of multiple ALUs). This unit, performs the desired operation on N different pieces of data at once.

kayvonf

For those interested:

realworldtech.com tends to have technical, but quite accessible, articles that describe the architecture of modern processors. This article on Intel's Sandy Bridge architecture is a good description of a modern CPU. It's probably best to start at the beginning, but Page 6 talks about the execution units.