What's New : Parallel Computer Architecture and Programming : Tsinghua Summer 2017

kayvonf commented on slide_019 of Memory Consistency ()

However, the big point of the lecture today was to get you to think about the effects of a relaxed consistency model. We used TSO and PC was two examples of relaxed consistency schemes (W->R was relaxed) in order to work through some exercises. However, I do not consider remembering the details of TSO and PC (beyond relaxing W->R ordering) the intellectually interesting part of the lecture. Even though we do not have an exam in this course, the details of specific relaxed consistency approaches would not be information that I would ever choose to put on an exam.

kayvonf commented on slide_019 of Memory Consistency ()

Yes, they are consistent. All the following statement: "A system exhibits Processor Consistency if the order in which other processors see the writes from any individual processor is the same as the order they were issued" says is that writes are not reordered (That W->W ordering is respected)

Processor consistency relaxes read after write (W->R) ordering. See the following from Wikipedia:

"Processor Consistency (PC) relaxes the ordering between older stores and younger loads that is enforced in Sequential consistency (SC). This allows loads to be issued to the cache and potentially complete before older stores, meaning that stores can be queued in a write buffer without the need for load speculation to be implemented (the loads can continue freely)"

Unlike TSO, PC does not provide a guarantee that all processors receive notification that a write has committed at the same time. If other words, if processor P2 receiving notification of a write from P1, there is no gaurantee that P3 is aware of this write yet.

swjz commented on slide_019 of Memory Consistency ()

I still don't get what PC exactly means. On Wikipedia, it says 'A system exhibits Processor Consistency if the order in which other processors see the writes from any individual processor is the same as the order they were issued'. Is it the same thing as the definition given in this slide? If the answer is yes, why the PC results of example 4 in the next slide don't match that of sequential consistency?

Kaharjan commented on slide_068 of Parallel Programming Basics ()

should be the line diff[index]=0.0f instead of diff[0]=0.0f? And I think when diff is diff[2] instead of diff[3],it still works...?

kayvonf commented on slide_041 of A Modern Multi-Core Processor ()

@lapack: comments are below.

"all of them are used to exploit ILP"

You wrote ILP here, but you really meant parallelism in general. ILP specifically is used to mean independent instructions in a single instruction stream. Hardware multi-threading is not exploiting independent instructions in an instruction stream... instead it is exploiting the property that the program features multiple threads of execution, or in order words, multiple instruction streams. It does turn out that when processing a single instruction stream, pipelining (although we have not talked about it in class yet), and superscalar execution depend on there being ILP in the instruction stream.

Your comments on pipelining the execution of instructions are correct. However, we have not discussed this in class so far. (We've only assumed that instructions have some latency.)

__To implement superscalar execution_ a processor must execute multiple instructions per clock (it needs more execution units) as well as the ability to fetch/decode more than one instruction per clock to feed these execution units. So I agree with you when you wrote "superscalar duplicates execution units and pipelines".

SMT: When you wrote SMT (simultaneous multi-threading) you actually mean interleaved [hardware] multi-threading. Please see my detailed description here and these review slides here. However, if you had written interleaved multi-threading out would have agreed completely with what you wrote.

"it can exploit ILP across multiple threads rather than only one thread"

Using the term ILP is not correct here. Multi-threading takes advantage of the fact that threads, by definition, are different instruction streams. A chip that runs more than one thread concurrently is exploiting thread-level (coarse) parallelism in the application, not instruction level parallelism without one threads.

Finally, you are correct that the Pentium 4 only supported one execution context and therefore did not support hardware multi-threading.

lapack commented on slide_041 of A Modern Multi-Core Processor ()

I want to make sure I understand several basic concepts correctly including pipeline, superscalar, SMT. Firstly, all of them are used to exploit ILP. Pipeline can improve CPU throughput not latency by using different hardware units simultaneously, such as IF, ID, Ex. Superscalar duplicates execution units and pipelines, for example two pipelines and two integer units, so it can execute two add instructions simultaneously, but CPU with one pipeline and execution unit can only do it sequentially. SMT adds more thread context but not execution units, it can exploit ILP across multiple threads rather than only one thread. This can't be done by superscalar. Let's take Pentium 4 as example, it has two Integer units and FPU, so I guess it has two pipelines and it's superscalar CPU. It seems not a SMT cpu. Am I right? Thanks!

kayvonf commented on slide_077 of A Modern Multi-Core Processor ()

@xielei. This is a good question. The answer is that your understanding of the pictures is correct, and that my pictures are not detailed enough to distinguish between the cases you describe. So you'd need more information about the dispatch rules supported by the core to really understand how it behaves. It is true that a picture with two execution contexts per core (blue boxes) and two decode units (orange boxes) might illustrate:

Run one thread per clock (interleaved multi-threading) with support for superscalar execution within the thread.
Run two threads per clock via simultaneous multi-threading, (SMT) but not support SIMD within the thread (max one instruction per thread)
Be able to run any mixture of instructions from either one or two threads (This is Intel Hyperthreading -- it can run two instructions from one thread (superscalar with interleaved multi-threading) or one instruction from each of two threads (simultaneous multi-threading and no superscalar)

But it would be very complicated to try and indicate all this information in a figure.

kayvonf commented on slide_004 of Parallel Programming Basics ()

@xielei. SIMD is also parallelism! It's just parallelism that shares an instruction stream. (Lecture 2 discussed both SIMD parallelism, multi-core parallelism, and superscalar parallelism.)

What I meant in the question above is a little subtle, so I'll explain here:

The code above defines the logic carried out sequentially by a single program instance. Therefore, to be precise, not on the loops you see are carried out in parallel. The first loop (i loop) iterates over a subset of the elements in the input array. The inner loop (j loop) iterates over terms in the Taylor expansion. A single program instance executes all this code serially.

However, as we know, ISPC never runs only a single program instance, it runs a gang of program instances simultaneously, by mapping the logic of all program instances in a gang to a stream of SIMD instructions. So the parallelism comes from launching a gang of program instances, which occurs in the call to the ispc function (sinx in this case). There is no parallelism expressed in the ISPC function's definition itself.

kayvonf commented on slide_007 of The SPMD Programming Model ()

@xielei. This is a good example of the difference between abstraction and implementation. You could say that ISPC presents the abstraction that each program instance following its own logical sequence of control. For example, an ISPC program and can while loops, if's, etc.

However, you are absolutely correct that the implementation of a gang of program instances, from the processor's perspective, is a single SIMD instruction stream. In other words, all instances in the gang share a single hardware execution context, and a single program counter. As a result of this implementation decision, whenever the code paths of different ISPC program instances diverge (e.g. at an if statement), the masking techniques described here must be used to preserve the abstraction. The result is a loss of performance since not all SIMD execution units (vector lanes) can do useful work in some cycles.

kayvonf commented on slide_057 of A Modern Multi-Core Processor ()

@xielei: The same number as a core will all the same features, but without multi-threading.

Compared to a basic processor without any of these features...

Multi-core execution adds execution units (via more cores... so it also adds execution contexts and instruction strream decode)
SIMD execution adds execution units (but no fetch/decode or execution contexts)
hardware multi-threading DOES NOT add additional execution capability, ONLY more execution contexts.

xielei commented on slide_004 of Parallel Programming Basics ()

Multiple simultaneous instruction streams is called parallelism, ISPC just emits SIMD instructions for this code, which are only one instruction stream operating on multiple data.

xielei commented on slide_057 of A Modern Multi-Core Processor ()

I don't understand "Core still has the same number of ALU resources", the same number as what? The number of contexts? But there are many examples where the numbers of execution contexts and ALU differ.

xielei commented on slide_007 of The SPMD Programming Model ()

I have a question from your comment above. Is each program instance a logical instruction sequence (a logical thread of control)? I think all instances have only one logical instruction sequence and they are only muliple data streams controlled by one instruction stream, as what SIMD means.

Thread actually has its own instruction stream independent from other threads.

xielei commented on slide_077 of A Modern Multi-Core Processor ()

How can we distinguish between superscalar and simultaneous multi-threading according to how the boxes are drawn? More than one context box means that multi-threading is supported, but many fetch/decode boxes have two meanings in this class, one is superscalar as is illustrated in this slide, the other is lecture 2, slide 60, where many fetch/decode boxes means simultaneous multi-threading. (However, the bright one with the dark one together means superscalar).

The question above is less important, but :

Superscalar needs at least two fetch/decode units, so does (simultaneous) multi-threading. (and multi-threading needs at least two contexts, but superscalar does not.) So if I have a core with many contexts and many fetch/decode, I can't know whether it is superscalar or simultaneous multi-threading?

kayvonf commented on slide_053 of A Modern Multi-Core Processor ()

The figure illustrates what thread (1, 2, 3, or 4) is executed math instructions on the core at a particular moment in time.

The Y-axis is time.
The blue regions indicate which thread is running on the core (instructions from this thread are being executed by the core's ALUs)
The yellow regions indicate periods of time a thread cannot run because it is waiting for memory to return data requested by a load.

You will notice that at any one instant in time (any row in the figure) there is at most one blue column. This is because the core only has one execution unit so only one thread can be running at once.

You will notice that there may be multiple yellow columns at a time, because we are assuming that multiple threads can be waiting on memory requests at the same time. (Multiple requests can be sent to memory.)

Kaharjan commented on slide_053 of A Modern Multi-Core Processor ()

@kayvonf So in above picture, blue box is time for loading elements 0 to 7, then run it using SIMD. if there is no blue box (0 latency), there would be no speedup compared to having multi-thread or not? am I correct?

If my assumption about SIMD correct, I think in the next slides, there should be no overlaps between yellow boxes just like blue box. am I correct?

kayvonf commented on slide_053 of A Modern Multi-Core Processor ()

@Kaharjan. This is close, but ...

Remember, this processor has only one fetch decode unit and one set of 8 SIMD ALUs. Therefore it can only run one SIMD instruction per clock. This is true regardless of how many of its four threads are actually runnable. Only one can run per clock, so their execution is interleaved in time on the core.

Summary: This illustration has one core that can run one SIMD operation per clock. Therefore it is not possible to run two threads simultaneously. Threads run interleaved on this core.

Note that multi-threading is only a technique that hides stalls. It is a technique for using the execution resources a processor has more efficiently, not increasing instruction throughput of the core when it is full utilized.

kayvonf commented on slide_050 of A Modern Multi-Core Processor ()

Let's ignore memory latency (say its 0) and assume all ops take 1 cycle. Then the code below:

LOAD LOAD MULT STORE

Would require four clocks. In practice, there are more complex issues in play in a real processor (pipeline, etc.) but 4 cycles is the correct answer given what we've talked about in class so far.

Kaharjan commented on slide_062 of A Modern Multi-Core Processor ()

because of many multi-threading

Kaharjan commented on slide_053 of A Modern Multi-Core Processor ()

@LuCheng I think, if memory latency is 0, all 4 threads in above CPU could run simultaneously (not concurrently). I don't know I am right or not?

Kaharjan commented on slide_050 of A Modern Multi-Core Processor ()

@kayvonf In above vector intrinsics, how many clocks to load, calculate and store? (assuming loading, calculating and storing take one clock time)

__mm256 a_vec = _mm256_mul_ps( _mm256_load_ps(&b, _mm256_load_ps(&c); 
_mm256_store_ps(a, a_vec);

In my view loading value of a and b takes two clocks, calculating takes one clock and storing is also takes one clock, am I right?

kayvonf commented on slide_053 of A Modern Multi-Core Processor ()

@LuCheng. This is a very good question. What do you think is the answer? If there are no stalls due to long latency operations (such as a data access) would would the efficiency of the processor be?

kayvonf commented on slide_050 of A Modern Multi-Core Processor ()

@LuCheng: There are two points I want to make:

Point 1:

The answer to your question is it will not. You have written C code that most modern compiles like g++ or clang would compile to regular "scalar" instructions. On a modern CPU, the only way to execute in a SIMD manner is for the instruction stream to have explicit SIMD instructions in it. For example, one of the following would need to happen.

Hope that G++ automatically vectorized the for look, removing the loop and turning it into a single 4-wide vector instruction. (This is not a transformation that a C compiler can do other than in the most simple cases.
Rewrite your program in an SPMD style language with a compiler that generates SIMD output (like ISPC)
Rewrite your program in vector intrinsics, like this:

__mm256 a_vec = _mm256_mul_ps( _mm256_load_ps(&b, _mm256_load_ps(&c); _mm256_store_ps(a, a_vec);

Point 2:

I am worried that since you asked this question on this slide, you might be confusing the concepts of SIMD execution, and hardware-multithreading. This slide is illustrating how if a core is able to maintain state for 4 different instruction streams (in other words, maintain 4 different execution contexts), it is possible to avoid stalls by immediately switching to another instruction stream when the current instruction stream would otherwise stall.

The choice of illustrating four execution contexts here is unrelated to 4-wide SIMD execution. Adding 4-wide SIMD processing capability modifies the execution capabilities of the core to perform four times more math operations in a single clock. When we are talking about multi-threading, we are not changing the number of operations that the core can perform at once. Instead, we are adding the ability to store multiple threads worth of state in the core in order to give the core the ability to use its existing execution resources more efficiently by hiding what would otherwise be processor stalls.

LuCheng commented on slide_053 of A Modern Multi-Core Processor ()

If the memory latency is 0, does multi-threading make any speedup?

Kaharjan commented on slide_050 of A Modern Multi-Core Processor ()

@LuCheng In my understanding SIMD could run using 4 ALUs. Specifically, in 2 clocks (assume loading takes one clock time) b,c values (all for value as vector) would be loaded to registers then 4 ALUs calculate mul . then write 4 values to c. So there would be only one execution context.

LuCheng commented on slide_050 of A Modern Multi-Core Processor ()

Can the following code run in parallel on this core?(same "mul" instruction but different address)

int a[4];
int b[4] = {0,1,2,3};
int c[4] = {4,3,2,1};
for(int i = 0; i < 4; i++) {
    a[i] = b[i] * c[i];
}

kayvonf commented on slide_062 of A Modern Multi-Core Processor ()

The answer to this question can really help understand the different between trying to build a computer than minimizes latency vs. one that maximizes throughput.

In many CPU applications, the goal of the application is to provide a fast response to the user on a single task. Think of applications like many of the apps on your phone, Microsoft Word, spreadsheets, etc. Therefore, a major goal of CPU design is to run a single instruction stream quickly. Ideas like [superscalar execution(http://15418.courses.cs.cmu.edu/tsinghua2017/lecture/whyparallelism/slide_047) are one way to improve the performance of a single instruction stream. Another way is to reduce the latency of memory access by adding caches. For many applications, bigger caches allows more memory accesses to be handled by the cache (larger caches reduce the number of cache misses). So on CPUs you see designs with only a few cores, and big caches. These designs are good when:

Programs have a lot of data reuse (access the same data multiple times)
There is not "other parallel work to do", so the only way to reduce stalls due to memory access is to __reduce memory latency with caches___, not hie memory latency by doing something else while waiting on memory to return the requested data.

In contrast, even though GPUs run many types of programs, GPUs are still designed for running real time graphics program. Those programs must proceed millions of pixels and vertices, so there are always many tasks to work on. As a result, rather than attempt to reduce latency by adding large caches to the GPU, GPU designers take advantage of the fact that most GPU workloads have large amounts of parallelism and use hardware multi-threading to hide latency. Since latency can be hidden, then big caches are not necessary to keep memory access times low.

So in summary: GPUs assume that the workload is very parallel and there are many subproblems to do (there's always something else to do in the event of a stall). Because of this assumption architects have prioritized latency latency via multi-threading over latency reduction via caching. CPUs assume the workload is not, and since a CPU cannot assume there are many other things to do in case of a stall, CPU designs attempt to eliminate or reduce the length of the stat via caching.

Question: Given the explanation above, give one reason why the GPU features higher memory bandwidth than the CPU.

ann commented on slide_062 of A Modern Multi-Core Processor ()

because the graphic processing instruction streams contain more bandwidth-consuming memory-related instructions such as load/store while CPUs have more arithmetic operations? and large quantities of different pixels are not suited for cache utilization?

Sirius commented on slide_060 of A Modern Multi-Core Processor ()

Thanks. It is clear to me now.

kayvonf commented on slide_062 of Parallel Programming Basics ()

Question: This is an important slide, since the need for mutual exclusion comes up many times in coordinating multiple threads.

(1) Can you describe the problem that occurs in the slide above? Two threads are trying to add 1 to the value of a shared variable diff.

(2) What is the property of mutual exclusion and how does it solve the problem?

(3) How can a programmer create a situation of mutual exclusion using locks?

kayvonf commented on slide_068 of Parallel Programming Basics ()

Question: Can someone explain how this solution works?

kayvonf commented on slide_038 of A Modern Multi-Core Processor ()

@Kaharjan. Yes, the SIMD width refers to the number of elements than can be processed at the same time. You can think of this as having multiple ALUs that all are controlled by the same instruction. It is also correct to think of SIMD processing as having a single ALUs that is able to execution on vectors of numbers instead of single numbers. Both descriptions would be equivalent for the purposes of this class.

kayvonf commented on slide_041 of A Modern Multi-Core Processor ()

@Kaharjan. Your statement is correct!

kayvonf commented on slide_074 of A Modern Multi-Core Processor ()

@Kaharjan. Yes, to make full use of this CPU, the program on the previous slide would need to be modified to create four threads. If the program as written was unchanged, it would only define two threads of execution. Those two threads would run on 2 of the 4 cores above, and 50% of the four-core CPU would not be used by the program.

kayvonf commented on slide_012 of A Modern Multi-Core Processor ()

Yes. State for one instruction stream = an execution context = one blue box.

This picture has one blue box, because the core is only able to manage a single thread.

Superscalar execution involves multiple fetch/decode units and multiple execution units processing multiple instructions from that same single instruction stream.

kayvonf commented on slide_059 of A Modern Multi-Core Processor ()

There are 16 cores x 4 independent instruction streams ("hardware threads") per core = 64.

Each thread executes 8-wide SIMD instructions that manipulate 8 elements of data.

So I need 64 x 8 = 512 pieces of data to process if I want to use all 64 threads, and have all the threads make use of 8-wide SIMD instructions.

kayvonf commented on slide_060 of A Modern Multi-Core Processor ()

Your thinking is on the right track, but here is a clarification.

Each "core" of the GPU is able to maintain state of 64 execution contexts (64 "hardware threads" = 64 "independent instruction streams").

Each of those threads executes 32-SIMD instructions, so that's 32 unique data elements processed by each hardware thread. This means that 64 x 32 = 2048 data elements are assigned to threads that are running concurrently on the core. I say concurrently because not all 64 of these threads can be executed exactly at the same time. Only 4 of these threads can be run in any one clock, so only 4 x 32 = 128 data elements can be processed simultaneously. (Notice there are only 4 x 32 = 128 ALUs in this picture.)

This picture not does illustrate superscalar execution. The core is choosing 4 different threads to run simultaneously and running one instruction from each of those threads. It is not superscalar because the core is not finding independent instructions within any one single thread.

Precisely, this core is performing interleaved multi-threading because it is able to maintain state for 64 threads, and only execute a subset of those threads (4) each clock. (It interleaves execution of the 64 threads onto the core's ALUs).

It is also performing simultaneous multi-threading because it is in fact running multiple threads simultaneously (4 of them).

If you look at the extra slides at the end of the lecture, there are further examples of the difference between interleaved and simultaneous multi-threading.

To be 100% precise, there is superscalar execution in this picture as well, but we did not talk about it in class. If you look at the picture there is a second fetch/decode unit shown that is grayed out. The core can actually select 4 threads to run, and then us 2-way superscalar execution to run two instructions from that thread if one instruction is a mathematical instruction running on the yellow ALUs and the other instruction is a LD/ST instruction that uses different units that I did not draw in the picture.

Kaharjan commented on slide_074 of A Modern Multi-Core Processor ()

I think previous program is also applicable, only if we create 4 threads

Kaharjan commented on slide_062 of A Modern Multi-Core Processor ()

GPU originally intend to process graphics need to access data quickly, so need to smaller caches to hide memory latency.

Kaharjan commented on slide_059 of A Modern Multi-Core Processor ()

There is 64 total concurrent instruction streams, I understand that. But I am not able to understand 512 independent pieces of work? How do we get that?

Kaharjan commented on slide_041 of A Modern Multi-Core Processor ()

In my understanding, Superscalar have more than one fetch/decode unite, so it could find parallelism automatically in instruction level. SIMD have one fetch/decode unite but several ALUs, so it can run the same instructions on different data, but controlled by compiler or programmer. Am I correct?

Kaharjan commented on slide_012 of A Modern Multi-Core Processor ()

In this architecture, why execution context drawn as one block, Dose that mean two fetch/decode unite share the same memory, and the same registers?

Kaharjan commented on slide_038 of A Modern Multi-Core Processor ()

Does SIMD width refers to number of ALUs? for example SIMD width 8 refers to there are 8 ALUs in once core?

Sirius commented on slide_060 of A Modern Multi-Core Processor ()

Hello, I am confused about how the concurrency should be calculated for a core with superscalar, SIMD and hardware multi-threading enabled. In this slide, I think the concurrency is 4(superscalar) * 32(simd) * 64(multi-threading) = 8192 though up to 4 * 32 = 128 elements can be processed simultaneously. Why is the answer 2048?

It seems that the concurrency has no relation with superscalar or the degree of SMT(4 here)?