What's Going On
haoran commented on slide_026 of Implementing Synchronization ()

Note head indicates ticket number and status 0 means lock is acquired.

haoran commented on slide_040 of Interconnection Networks ()

Wormhole flow control requires more sending and receiving. Compared to package level routing, is the performance of wormhole flow control on par with package level routing?

haoran commented on slide_037 of Scaling a Web Site ()

Adding cache designates a web server a state. How can load balancer determine which web server has the cache the request needed? If dispatches are largely random, locality is not exploited.

Is" with higher core count, the computation is increasingly bandwidth bound" due to fixed number of disks?

Typo: the incoming edges for vertex 6 are from {3, 5} instead of {3, 6}

I think one needs to hold the locks of the node's parent, itself and all the children nodes in the traversing.

i.e. n/r * perf(r) is different given different r, even the total number of transistors doesn't change.

Yes, then you are correct. Smaller cores give you more computational power, if you have the parallelism to use them.

Oh. Sorry to create the confusion. What I'm referencing is the computation power not the physical power.

No. perf(r) as sqrt(r) indicates as you double the transistors, performance increases by 1.4x. Power still remains proportional to the number of transistors.

If perf(r) modeled as sqrt(r), the total computation power varies given different partition condition even the total number of transistors is the same.

Without looking back at the paper, I suspect that the curves decline due to cache effects. As in, the larger working set reduces the arithmetic intensity, which lowers the GFLOPs.

In one sense, FPGAs and ASICs have significantly more parallelism than GPUs. They are just gates. So if you need 1 million adders, then you design that component. Usually, you think about implementing a computation efficiently, and then tiling that component (such as the added previously) until the space is used.

Also, in general, these components require operations similar to cudaMemcpy to explicitly move the data and then "launch" the computation.

It has been 15 years, since I last tried to program an FPGA. It was difficult then. My rough recollection was that we first wrote the code in software, taking about 1 week. Then 6 weeks to implement on the FPGA. Mainly, you have to express everything in much greater detail and debugging / "compiling" takes significantly longer.

Why add FPGAs? Data centers have large-scale deployments, so small changes can make measurable differences (think about my example of screws). And energy usage matters. Search logic regularly changes, so ASICs are not viable. The workload does not correspond well to GPUs. This leaves FPGAs as a possible approach to improve efficiency versus CPUs, while still retaining some flexibility.

Why do we not have cores with FPGAs? The same as the other heterogeneous examples in class. Each component starts as a separate device, until it is integrated in after a certain level of market penetration.

Generally, the OS relies on the program to make any scheduling decisions, since they are generally written with the assumption that all cores are equal in compute power. With big.LITTLE, the OS may be aware of the option to switch, but it is often coarse grained (either all big or all LITTLE).

To extend this formula to support multiple "fat" cores, perf(r) + (n-r) => kperf(r) + (n-kr).

As for the performance, in (almost) all cases, having k > 1 is worse than k=1.

sqrt(r) - https://en.wikipedia.org/wiki/Pollack's_Rule

bpr commented on slide_014 of Performance Monitoring Tools ()

The counters are always available, except enabling them is (usually) a privileged operation and requires the kernel to program / read. Their cost primarily depends on the threshold value, after which point an interrupt is triggered. In general, the cost is <5% overhead.

haoran commented on slide_014 of Performance Monitoring Tools ()

Are these counters always available, or only enabled under a special mode? I believe always turning on these recording can hurt perf.

Okay, you can only enable four counters once.

scottm commented on slide_034 of Snooping-Based Cache Coherence ()

I believe state O must be elevated to state M before writing again.

haoran commented on slide_034 of Snooping-Based Cache Coherence ()

Can state O continue modifying its content? From my point of view, every modification on state O cache should broadcast to others.

Recently Intel's new Skylake-X platform changed cache hierarchy from a small L2 / big shared L3 to a big L2 / small shared L3. Considering this year Intel almost doubled the core count on its highest-end product, is this shift a technical compromise of scalability or a performance optimization?

haoran commented on slide_024 of Parallel Programming Case Studies ()

What is extra work here?

More specific, a system with write-allocate policy.

While not universally true, it is the commonly used heuristic.

"Steals work at beginning of call tree: this is a "larger" piece of work, so the cost of performing a steal is amortized over longer future computation."

=> I suppose this is talking about quicksort example? This is of course not universially true.

The example rule is that the system must load an entire cache line before any stores to that line can occur.

These are different memories. The per-thread private memory is basically registers. The shared memory is similar to an L1 D$. And the global memory is typical DRAM.

Are these three types of memory just the same memory hardware with different namespaces? Or do they have any hardware difference?

I don't understand the second point. How do rules of operation cause unnecessary communication? Where does the 2x overhead come from?

haoran commented on slide_008 of Parallel Programming Basics ()

As in the figure, L is the edge of the grouped mass, D stands for distance. Theta is just a threshold constant.

I think "hard scaling" is also known as "strong scaling".

star013 commented on slide_070 of GPU Architecture & CUDA Programming ()

My understanding of 'warp'

In SPMD model, the program logic is the same for each thread which computes the data. Thus the instructions for each thread seem to be same. Previously, I am confused that if the instruction for each thread is the same, why one warp context contains 32 different parts for different threads (The figure in the previous slide). One possible reason I guess is that different thread may have different conditions. For example, if there is a "for" loop and "if" statement inside the loop, the number of running cycle (I mean the instructions inside the loop) may depend on the data. Loop in some thread may stop early while loop in other threads may still run. At this time the former thread will be idle and the method to do this may be masking the corresponding instructions. Therefore it is necessary to hold different parts for different threads in one warp. I do not know whether this explanation is correct. Welcome to verify my thoughts!

star013 commented on slide_046 of GPU Architecture & CUDA Programming ()

Explain the first red rectangular part of this slide: My understanding is that when GPU allocates memory on per-block shared memory, it interprets the assignment instructions as a whole rather than runs the load instructions in each thread. It makes sense because the shared memory allocation can be done without a thread and this assignment method reduces the total instructions and can make use of space locality to speed up loading.

amolakn commented on slide_037 of Parallel Programming Abstractions ()

@haoran From what I know, not necessarily. The certainly can be, there are protocols like XPC (cross process communication) that exist for processes to communicate.

However, as you'll see in a later assignment, sometimes it's better to think of it as machines over the network. When you send or receive a message, it may be to another machine over the network for distributed processing across multiple machines (technically they're processes on different machines, but the point is just to be aware that it's not always thought of as inter process communication).

amolakn commented on slide_002 of A Modern Multi-Core Processor ()

@hrw Yes.

@BerserkCat Also yes. This is when chips use more transistors optimizing single stream execution.

Whenever I think of this, I know branch prediction is one of those optimizations which is a whole research area in itself. Every time I think of branch prediction I think of this clip from a Steve Jobs keynote where he mentions branch prediction and has absolutely no idea what it is.


amolakn commented on slide_048 of Parallel Programming Basics ()

@hrw If the code weren't being run in a loop, you're right in that only the second would be necessary. But consider the fact that it is being run in a loop.

The second barrier is pretty clear, let's have each thread finish before we do our check in order to ensure that our check is accounting for all threads. But consider the next iteration if the first and third barrier weren't there. For this example, I'm going to say we have two threads, thread A and thread B.

Let's start with a example of what could happen if the first barrier weren't there:

Thread A | Thread B

Enter loop

diff = 0.0f


diff += myDiff

Hit barrier 2

            Enter loop

            diff = 0.0f


            diff += myDiff

            Hit barrier 2

Do you see what went wrong here? The thread A computed myDiff and added it to diff, but the thread B then reset diff to 0 again hence undoing whatever thread A just did in its execution.

I want to leave this open for discussion, can anyone else explain why the third barrier is necessary?

hrw commented on slide_048 of Parallel Programming Basics ()

Why are the first and third barrier necessary?

hrw commented on slide_008 of Parallel Programming Basics ()

What do L, D & theta stand for?

haoran commented on slide_037 of Parallel Programming Abstractions ()

Are threads basically processes in message passing model?

@SuperMario, you will have your chance to experience some of the models and their relative strengths as part of this course. :)