Slide View : Parallel Computer Architecture and Programming : Tsinghua Summer 2017

Previous | Next --- Slide 53 of 86

kayvonf

The important think to note on this slide (and the last one) is that the total amount of time it takes to complete the work of one thread is longer than if hardware multi-threading was not used. This is because the thread has to share the core's execution resources with other threads.

In this example, even through thread 0 could have started running at the point memory returns the necessary data, it cannot run because some other thread is using the core's execution resources.

If the goal is to complete any one thread in the minimum amount of time (reduce the latency to complete a single operations), then hardware multi-threading is not a good idea.

If the goal is to maximize the efficiency of the system (maximize utilization of the processor, which is equivalent to maximizing completed instruction throughput, then multi-threading helps do this since it provides a way to avoid stalls due to waiting on memory. We care about throughput when we have many threads to run and we want to completed all the threads as quickly as possible, but we don't care how long it takes to complete any one thread.

LuCheng

If the memory latency is 0, does multi-threading make any speedup?

kayvonf

@LuCheng. This is a very good question. What do you think is the answer? If there are no stalls due to long latency operations (such as a data access) would would the efficiency of the processor be?

Kaharjan

@LuCheng I think, if memory latency is 0, all 4 threads in above CPU could run simultaneously (not concurrently). I don't know I am right or not?

kayvonf

@Kaharjan. This is close, but ...

Remember, this processor has only one fetch decode unit and one set of 8 SIMD ALUs. Therefore it can only run one SIMD instruction per clock. This is true regardless of how many of its four threads are actually runnable. Only one can run per clock, so their execution is interleaved in time on the core.

Summary: This illustration has one core that can run one SIMD operation per clock. Therefore it is not possible to run two threads simultaneously. Threads run interleaved on this core.

Note that multi-threading is only a technique that hides stalls. It is a technique for using the execution resources a processor has more efficiently, not increasing instruction throughput of the core when it is full utilized.

Kaharjan

@kayvonf So in above picture, blue box is time for loading elements 0 to 7, then run it using SIMD. if there is no blue box (0 latency), there would be no speedup compared to having multi-thread or not? am I correct?

If my assumption about SIMD correct, I think in the next slides, there should be no overlaps between yellow boxes just like blue box. am I correct?

kayvonf

The figure illustrates what thread (1, 2, 3, or 4) is executed math instructions on the core at a particular moment in time.

The Y-axis is time.
The blue regions indicate which thread is running on the core (instructions from this thread are being executed by the core's ALUs)
The yellow regions indicate periods of time a thread cannot run because it is waiting for memory to return data requested by a load.

You will notice that at any one instant in time (any row in the figure) there is at most one blue column. This is because the core only has one execution unit so only one thread can be running at once.

You will notice that there may be multiple yellow columns at a time, because we are assuming that multiple threads can be waiting on memory requests at the same time. (Multiple requests can be sent to memory.)