Slide View : Parallel Computer Architecture and Programming : 15-418/618 Fall 2016

A Modern Multi-Core Processor: Forms of Parallelism + Understanding Latency and BW

Previous | Next --- Slide 51 of 79

perfecthash

I found it interesting how multi-threading actually increases the run time of a single thread, but will use a ALU more efficiently, because during the stall of a single thread, the core runs a different thread. This actually hides the memory latency of the different threads. The thread will resume after the stall while other threads are on stall. Thus, the ALU is always being put to good use!

tcm

Something similar happens in basic CPU pipelining: the overall time to execute a single instruction increases with pipelining, but that doesn't matter, because we are executing on the order of billions of them per second. Instead, what matters is throughput of the instruction stream, and pipelining improves this.

aravisha

Couldn't this approach potentially slow down the processor significantly if the memory bandwidth is not enough to allow multiple threads to be loading/reading at the same time? At this point, wouldn't it make more sense to revert to the old convention of just letting one thread execute and complete it's job before proceeding?

tcm

If the program is already bandwidth-limited, then neither prefetching nor multi-threading can make it run faster (since both of those techniques rely upon having additional memory bandwidth to exploit). Once the program is running at a bandwidth-limited rate, while this style of multi-threading won't help, I would not expect it to slow things down further. Was there a particular type of overhead that you were concerned about? Keep in mind that the processor has lots of time on its hands (so to speak) while it is running in a bandwidth-limited mode.

aravisha

I guess I was thinking of the following scenario: Say, for our program, the memory bandwidth was only large enough to allow 1 thread to do memory reads/writes at a time. When the first thread was busy trying to access memory (stalling), would the subsequent threads be woken up and started? Or would the processor realize that it was running at capacity and simply wait for the first thread to finish?

tcm

(I'm going to let some of your classmates chime in with their thoughts on this before I respond.)

amolakn

I had another question, just fundamental since this confused me a bit earlier.

Difference between SIMD and multithreading. Having multiple ALUs does NOT let you run different instruction sets at the same time, but rather lets you run the same instructions (like that in a loop) at the same time when needed.

So the difference with multi threading is that it doesn't allow for two threads to run at the same time with different instructions (since only one thread can utilize the processor at one time), but rather that they have independent contexts so that stalls on each thread won't interfere with others? Are there examples besides stalls where there's also a benefit to multi threading?

wxt

@aravisha: if memory bandwidth is reached by a single thread, no other threads can be doing memory reads/writes, regardless of stalling. Memory bandwidth has to do with how much data is being moved to and from memory, not necessarily how much work is being done by a thread. You can have 10 threads each doing the same operation while taking up 1/10 of memory BW, but an 11th won't be started during the stalling time of any of those threads because BW is saturated.