What's Going On

Are these three types of memory just the same memory hardware with different namespaces? Or do they have any hardware difference?


I don't understand the second point. How do rules of operation cause unnecessary communication? Where does the 2x overhead come from?


haoran commented on slide_008 of Parallel Programming Basics ()

As in the figure, L is the edge of the grouped mass, D stands for distance. Theta is just a threshold constant.


I think "hard scaling" is also known as "strong scaling".


star013 commented on slide_070 of GPU Architecture & CUDA Programming ()

My understanding of 'warp'

In SPMD model, the program logic is the same for each thread which computes the data. Thus the instructions for each thread seem to be same. Previously, I am confused that if the instruction for each thread is the same, why one warp context contains 32 different parts for different threads (The figure in the previous slide). One possible reason I guess is that different thread may have different conditions. For example, if there is a "for" loop and "if" statement inside the loop, the number of running cycle (I mean the instructions inside the loop) may depend on the data. Loop in some thread may stop early while loop in other threads may still run. At this time the former thread will be idle and the method to do this may be masking the corresponding instructions. Therefore it is necessary to hold different parts for different threads in one warp. I do not know whether this explanation is correct. Welcome to verify my thoughts!


star013 commented on slide_046 of GPU Architecture & CUDA Programming ()

Explain the first red rectangular part of this slide: My understanding is that when GPU allocates memory on per-block shared memory, it interprets the assignment instructions as a whole rather than runs the load instructions in each thread. It makes sense because the shared memory allocation can be done without a thread and this assignment method reduces the total instructions and can make use of space locality to speed up loading.


amolakn commented on slide_037 of Parallel Programming Abstractions ()

@haoran From what I know, not necessarily. The certainly can be, there are protocols like XPC (cross process communication) that exist for processes to communicate.

However, as you'll see in a later assignment, sometimes it's better to think of it as machines over the network. When you send or receive a message, it may be to another machine over the network for distributed processing across multiple machines (technically they're processes on different machines, but the point is just to be aware that it's not always thought of as inter process communication).


amolakn commented on slide_002 of A Modern Multi-Core Processor ()

@hrw Yes.

@BerserkCat Also yes. This is when chips use more transistors optimizing single stream execution.

Whenever I think of this, I know branch prediction is one of those optimizations which is a whole research area in itself. Every time I think of branch prediction I think of this clip from a Steve Jobs keynote where he mentions branch prediction and has absolutely no idea what it is.

https://youtu.be/jAsxzwHaGjk?t=1m54s


amolakn commented on slide_048 of Parallel Programming Basics ()

@hrw If the code weren't being run in a loop, you're right in that only the second would be necessary. But consider the fact that it is being run in a loop.

The second barrier is pretty clear, let's have each thread finish before we do our check in order to ensure that our check is accounting for all threads. But consider the next iteration if the first and third barrier weren't there. For this example, I'm going to say we have two threads, thread A and thread B.

Let's start with a example of what could happen if the first barrier weren't there:

Thread A | Thread B

Enter loop

diff = 0.0f

Compute

diff += myDiff

Hit barrier 2

            Enter loop

            diff = 0.0f

            Compute

            diff += myDiff

            Hit barrier 2

Do you see what went wrong here? The thread A computed myDiff and added it to diff, but the thread B then reset diff to 0 again hence undoing whatever thread A just did in its execution.

I want to leave this open for discussion, can anyone else explain why the third barrier is necessary?


hrw commented on slide_048 of Parallel Programming Basics ()

Why are the first and third barrier necessary?


hrw commented on slide_008 of Parallel Programming Basics ()

What do L, D & theta stand for?


haoran commented on slide_037 of Parallel Programming Abstractions ()

Are threads basically processes in message passing model?


@SuperMario, you will have your chance to experience some of the models and their relative strengths as part of this course. :)


SuperMario commented on slide_053 of Parallel Programming Abstractions ()

While shared address space models are the typical taught method of parallel programming, I am interested to see the drawbacks and benefits of programming in the other two models as well.


SuperMario commented on slide_063 of A Modern Multi-Core Processor ()

GPU's are built this way because they typically are used in Graphics processing (which is inherently parallel), and in ML matrix operations (which are also parallel).


SuperMario commented on slide_033 of Why Parallelism? Why Efficiency? ()

This is a scheduling problem that can often times highly reduce power efficiency, as well as performance.


haoran commented on slide_040 of A Modern Multi-Core Processor ()

The major difference between superscalar and Multi-core/SIMD is that the former is not user or compiler aware.


haoran commented on slide_039 of Why Parallelism? Why Efficiency? ()

ILP means Instruction-level parallelism.


haoran commented on slide_017 of Why Parallelism? Why Efficiency? ()

There are at least two reasons we don't want large chips: 1. we don't want chips taking too much space in our device. For example, most of the space in my iPad is taken by battery. A larger chip will result in smaller battery if we have fixed amount of space. 2. Larger chip will have lower yield rate. The cost will go up.


I don't really understand the drawbacks of stream programming in this slide. Is it saying that it would be hard for the library to express some complicated data flows? Or is it challenging for the compiler to carry out highly optimized execution plan?


hrw commented on slide_015 of A Modern Multi-Core Processor ()

I think it is saying that the multi-simpler-core solution is better than single-fancy-core solution given the same amount of transistors.


hrw commented on slide_002 of A Modern Multi-Core Processor ()

To Self check 1: I think it means the processor is doing(fetching) one instruction at a time.


BerserkCat commented on slide_003 of A Modern Multi-Core Processor ()

So I think it's pretty easy to see that the two challenges of accessing memory are memory latency and memory bandwidth, but I think that we covered a lot of ideas concerning parallel execution in class, and I'm not sure which two are the most important concepts mentioned here. Can anyone tell me what they are?


BerserkCat commented on slide_002 of A Modern Multi-Core Processor ()

Self check 2: According to Wikipedia: "a single-core superscalar processor is classified as an SISD processor (Single Instruction stream, Single Data stream)", so it is about optimizing the performance of executing a single-instruction stream.


magister commented on slide_019 of Why Parallelism? Why Efficiency? ()

I think as the transistor sizes became smaller, the resistance and capacitance went down, enabling higher clock rates. Interestingly, this also enabled more space on the die to fit more logic. Going parallel was the logical choice for the additional free space on the die after shrinking the transistors.


ametoki commented on slide_056 of A Modern Multi-Core Processor ()

Isn't SMT somewhat similar to ILP, in that they both try to make use of multiple ALUs at same time? The difference, as I understand it, is that ILP trys to parallelize a single thread just-in-time, while SMT runs operations in another thread when there are idle ALUs in this core.


ametoki commented on slide_027 of A Modern Multi-Core Processor ()

As told in class, AVX does "vectorized" arithmetic calculations on special 256-bit registers (for instance), effectively handling multiple (i.e. 8 32-bit float, 4 64-bit double) normal calculations with just one instruction.

My question is, how is an AVX instruction carried out in hardware? In the example of a 256-bit add of 8 floats, do all the 8 ALUs (of our imaginary cpu) carry out the calculations simultaneously?


scottm commented on slide_064 of A Modern Multi-Core Processor ()

$ 12 * 480 * 1.2 * 1000000000 / 1024 / 1024 / 1024 / 1024 = 6.29 TB/s $


Question: What's the difference between parallel computing and distributed computing?

The answer I found: "In broad terms, the goal of parallel processing is to employ all processors to perform one large task. In contrast, each processor in a distributed system generally has its own semi-independent agenda, but for various reasons, including sharing of resources, availability, and fault tolerance, processors need to coordinate their actions." from [1],[2].

[1] Distributed Computing. Fundamentals, Simulations, and Advanced Topics. Hagit Attiya and Jennifer Welch. 2004.

[2] https://cs.stackexchange.com/questions/1580/distributed-vs-parallel-computing


I think the alternative way to increase # transistors, other than increasing the density, is to increase the space of the chip. In this way, the power density doesn't increase while total computation power goes up. Why does this solution not work?


park914 commented on slide_015 of A Modern Multi-Core Processor ()

Why not use two fancy cores? Is it just because it is too expensive?


zihengl commented on slide_033 of A Modern Multi-Core Processor ()

Explanation of current slide

(Under the background of masked vector program)

When there's a conditional branch (i.e. if-else statement in c), the way that a masked vector program will handle that is:

  1. The mask is set according to the specified condition, corresponding to the values in the current vector being handled. For example, assume a vector of width 4 containing values {2, -4, 3, 5}. The mask, under the condition of (x>0), will be set as {1, 0, 1, 1} indicating the result of the comparison.

  2. Codes in the corresponding block are executed based on mask values. For example, in the case given in step 1, ALU 1, 3, 4 will execute the code in the if{} block while ALU 2 is idle. Thus in this case the utilization is 3/4. In the worst case the utilization will go down to 1/VECTOR_WIDTH. (After ALU 1, 3, 4 has finished executing the if{} block, ALU 2 will execute the else{} block and the utilization will be 1/4)


krishang commented on slide_017 of Why Parallelism? Why Efficiency? ()

I think your argument makes sense, i.e. the area did increase. I just noticed that, while 8086 chips were packaged as Dual Inline Package (DIP), the 386 chips were packaged as 132 pin PGAs. That seems like a reasonable increase in size, but may be we're missing a bigger factor here?


I'm not sure, but I know that it's not the same as the number of transistors--it's the amount of power consumed per unit area, so if the area increased then the power density would decrease. If the number of transistors increased, but they became more efficient, or were spread out over a larger area, then that may explain the power density changes. Not sure though... Tried Googling without luck :(


scottm commented on slide_030 of Why Parallelism? Why Efficiency? ()

The speedup is not necessarily gained from the use of more than one processors.

For instance, in the case of ISPC, $\text{speedup (with ISPC)} = \frac{\text{execution time (without ISPC)}}{\text{execution time (with ISPC)}}$.


krishang commented on slide_017 of Why Parallelism? Why Efficiency? ()

What is reason for the power density going down between '80 and '90, even though the no. of transistors kept increasing?