Question: It would be awesome if someone took a stab at providing definitions for these terms; these will come up time after time in this class (when we analyze program performance and hardware design), so it's best to make the investment to properly understand what they mean.
I could work on this. Would a text dump here be ok? Or I could give a Dropbox link to a text file or something.
@HLAHat: Definitely just give it a shot right here!
Multi-core processor: a processor which has two or more independent processing units.
SIMD execution: single instruction multiple data, which is to execute same instruction on multiple data simultaneously.
Coherent control flow: same instruction sequence can be applied to all elements operated upon simultaneously and would make efficient use of SIMD processing resources.
So here's a pretty basic overview of all the terms. Please, please, please give feedback! If one of my definitions is unclear (or just wrong) or you feel like there's an important aspect I failed to cover, let me know and I'll edit it in. Of course, if you find any typos or anything small, I'd like to hear that too! Thanks all.
Multi-core processor: A computer component with multiple processing units present to execute instructions.
Basically just a CPU containing multiple units (cores) capable of processing instructions.
SIMD Execution: Stands for Single Instruction, Multiple Data.
This is where a CPU core contains multiple ALUs which all run the same instruction per clock, but on separate data.
That is, the CPU decodes oe instruction and sends it to all ALUs. The ALUs then can all grab their own unique data and perform the given instruction on the data simultaneously.
The comments on Slide 22 give some extra context as well.
Coherent Control Flow: Where every element being processed has the same instruction sequence perform on it.
See Slide 33.
A program must have Coherent Control Flow in order to use a SIMD architecture.
Note that programs lacking this property may still be parallelizable.
Hardware Multi-Threading: The processor (as opposed to the Operating System) makes decisions about which threads to run (and then runs them).
The processor needs space to store the context for each thread (register states, etc.).
Interleaved Multi-Threading: At each clock, a core chooses one thread to run.
An example for this starts at Slide 44.
Simultaneous Multi-Threading: Each clock, a core will pick runnable instuctions from multiple threads and run them on its ALUs.
A summary of both concepts can be found on Slide 53 and a list of general pros/cons can be found on the following slide.
Memory Latency: This is the Time Delay between a processor requesting data from memory and receiving the data.
See Slide 40.
Memory Bandwidth: This is different from Latency in that Bandwidth measures the amount of data that can travel to the processor from memory over a span of time.
For example, Let's say 1MB of memory data takes 1ms to transfer in Processor 1. It has a latency of 1ms (time taken for data to travel) and a bandwidth of 1 GB/sec.
In Processor 2, 1KB of memory gets transferred in 1us. The latency is much shorter, but the Bandwidth is exactly the same!
Slide 40 enumerates the differences between this and Latency.
Bandwidth Bound Application: "Bound" just means that execution speed of a program is limited by some factor.
So, if an application is Bandwidth Bound, this means that the execution speed of the application is limited by the memory Bandwidth.
That is, the CPU will have no latency troubles or anything when executing, it's just that memory cannot provide ALL of the required data to the CPU in time.
This means the CPU will likely have to wait for multiple segments of data to arrive from memory, even if each individual segment gets there quickly.
See slides 62 and 63.
Arithmetic Intensity: The ratio of math operations to data access. See Slide 63.
Higher is better. That is because the CPU can perform math on data it already has while it's waiting on deficient bandwidth to deliver more data.
@HLAHat has put in a great effort here. Let's have some people jump in to help refine.
Multi-core processor: contains a bunch of different orange things (independent cores) so it can run completely separate processes in parallel
SIMD: single instruction, multiple data. This is the vector math multiplication we discussed earlier. Suppose you're trying to calculate the Taylor series expansion for a bunch of different n. The n=1 and n=2 case can be calculate independently from one another, but both follow the same instruction stream. As a result, you can do them in parallel on multiple ALUs while giving those ALUs the same set of instructions.
Coherent control flow: if statements mess things up. When you have a program that takes multiple branches, you could get really bad efficiency if only one of the calculations that you could have done in parallel (as per SIMD) takes a different branch than the rest. This is what the red "x"s on slide_032 illustrate. If all of the data took the true branch, you'd have coherent control flow.
Hardware multi-threading: one processor with multiple ALUs and multiple sets of registers. Processor decides to send instructions to various ALUs/registers; when one thread stalls (e.g. waiting for a memory access to complete) we go on to the next one.
Interleaved multi-threading: one thread at a time. Core goes through and picks the next thread that can currently do something.
Simultaneous multi-threading: related to how CPUs used to execute multiple instructions at the same time (superscalar execution). Core chooses a combination of instructions from multiple threads that best use the resources that core has access to. For example, if there's one thread that needs all of the ALUs to perform some oeprations, the core could only work on that thread for now. However, if one thread wouldn't use all the resources then the core would pick multiple threads and execute all the instructions from them it could at the same time.
Memory latency: let's say I ping you. The time you take to get back to me is the latency.
Memory bandwidth: how much data can be transferred at a time
Bandwidth bound application: if you have sufficiently large data, clever ways of parallelizing operations won't help you speed things up. For example, in the thought experiment from slide_061 the GPU can calculate the multiplication 8x faster than a CPU simply because the GPU has 8x the memory bandwidth of the CPU. In that example the data is large enough that you have to keep fetching data from memory to continue the operation, so no matter how you do the actual calculation the amount of time it takes to fetch all the data you need from memory is the limiting factor on performance.
Arithmetic intensity: it's quicker to recalculate something you need instead of store it if storing it would make you need to retrieve it from memory later. Specifically, arithmetic intensity is the ratio of math operations to data access operations. Say a math operation takes 5 cycles to complete but retrieval from memory takes 30 cycles; then if your arithmetic intensity is less than 6 you should probably just recalculate it.
@afzhang Good job - I like your explanations (you just saved a lot of us work later in the semester). Two minor critiques:
First, be careful with the multi-core processor definition: the orange boxes shown on most of the slides were just decoders/ memory fetchers. Though it is a bit hairy to define a core (see the discussion on the first few slides), in this class we consider a core to at least include an instruction/ memory fetcher, (an) ALU(s), and (an) execution context(s).
Second, though it probably won't make a difference in your life, be careful with separating superscalar execution from (the above described) simultaneous multi-threading on multiple ALUs. Superscalar execution can occur even if you have one thread and only one ALU. If my understanding is correct, in superscalar execution, one ALU is used to simultaneously process two different operations. For example, if I need to add two numbers and do a bit shift, I can input the addition information, and while the ALU is processing that (it does take ALUs time to complete such operations) we can send in the information to perform the bit shift, which may be done at the same time. The reason this works is because, if the operations are chosen properly, they are mostly executed on different parts of the ALUs hardware - so they don't conflict, rub shoulders, or write over one another being that they don't physically contact each other.
I guess it is possible to get bogged down in arbitrary amounts of technical detail with some of this. I hope the above is correct. If not, someone feel free to shout me down.
@lament: Good comments. One thing I want to clarify is the "one ALU is used to simultaneously process multiple instructions". Given how I've chosen to describe things in this lecture, A hardware execution unit can one do one operation at a time. If a processor is to do two things at once, it requires multiple execution units. Superscalar execution involves using multiple execution units simultaneously to perform different instructions from a single thread. If I don't have multiple yellow boxes in my drawings, then the processor is not capable of performing multiple instructions at once.