What's Going On

To restate what Kavyon said in point one...

The move to finer grained locks is motivated by a desired to increase potential parallelism. Yet as we increase the number of locks we generally incur the need to touch multiple locks in the course of a single operation. Each additional lock adds a fixed amount of overhead to the operation. So as we make our locking granularity finer, and thus the ability for smaller units of work to happen simultaneously, we inherently increase the fixed cost of those smaller units of work until the overhead becomes a dominating factor.

Perhaps this suggests a hybrid strategy where we have both fine granularity locks and a larger coarse lock. When we know we are going to perform an operation that operates across the entire data structure, for example a rehash on a hash table, we can acquire the global lock and perform our actions on the data structure without having to incur the overhead of acquiring every fine grained lock.


Instead of having two separate counters for both arriving and leaving, we can combine them into a single counter:

struct Barrier_t {
  LOCK lock;
  int counter;
  int flag;
};

void Barrier(Barrier_t* b, int p) {
  while (b->counter > 0 && flag == 1);
  lock(b->lock);
  if (b->counter == 0) {
    b->flag = 0;
  }
  int num_arrived = ++(b->counter);
  unlock(b->lock);

  if (num_arrived == p) {
    b->flag = 1;
  } else {
    while (b->flag == 0);
  }
  --(b->counter);
}

Please go over it yourself to verify that it works as I am not 100% sure.


MPI is a concrete example of software choosing to always busy wait over blocking. One might think that this would be inefficient but it's the default because we generally own the machine.

MPI is generally used in high-performance super computer environments where the running program is the only intensive program running. In this situation, it makes sense to allocate a number of processes exactly matching the number of execution units. Choosing to block in this scenario wouldn't make sense because there is no other process waiting to run and we simply incur the overhead of a context switch in and out for the waiting process.


What is the power and hardware overhead to correctly supporting parallelism and coherence? The whole point of creating systems which process independent instruction streams is to avoid the power wall associated with improving single instruction stream performance. It seems that we are able to utilize transistors more effectively at lower power by utilizing parallelism but, as we've seen in the complexity required to support this parallelism, it's not completely free. Is it possible that we have simply pushed off the issue of power by adopting a solution with a lower growth rate? That is, is there a point at which the mechanisms to support parallelism start stressing our power limits? If not, what is the fundamental reason and what does it suggest about future systems?


apoms commented on slide_011 of Scaling a Web Site ()

Most high performance web servers these days use an event driven approach to handling requests. In a traditional web server, every I/O operation (a syscall, disk read, network request) blocks the calling thread until it completes. The workaround for this issue is to oversubscribe our system by launching threads or processes for each request thus allowing the OS to hide the latency by unscheduling those blocked tasks. This approach works great for a small number of concurrent requests because we only need to allocate a few threads or processes at any one time.

However, today's web sites must serve hundreds of thousands of concurrent requests to meet the demands of their users. In a traditional server as described above, this is simply impossible for two main reasons. First, in order to serve 100,000 concurrent requests, the web server must allocate 100,000 thread or process contexts. On a Linux server, the default thread stack size is 2 MB meaning that the web server would need 200 GB of memory JUST to hold the initial thread state -- not even counting all of the other memory required during the execution of the thread. This issue is even worse for a process context. Second, every time a thread finishes or performs some I/O, it must be ripped off the processor using expensive system calls and context switches. Each of these context switches can take hundreds of cycles leading to untenable overhead and the OS now must manage scheduling for thousands of unique threads.

In response to these two issues, web servers now take advantage of special non-blocking operations for performing system calls, disk I/O, network requests, etc to avoid spawning a unique task or process context per request. The main idea here is that we can decouple accepting and responding to web requests from the work to process a request by storing newly accepted requests in a work queue and storing processed request results in an output queue. By explicitly enumerating our outstanding tasks instead of punting them onto the OS we are now free to process these queues in whatever schedule we so desire and using whatever execution resources we want.

There are two main ways to abstract and implement this architecture which I'll go over another time.


I like to think of these consistency models as saying things about the timelines of read and writes of the processors on a system.

In the TSO model, all processors observe the exact same interleaving of writes to different memory locations on their timeline.

In the PC model, each processor can observe a different interleaving of writes to different memory locations on their timeline.

I say interleaving here because the order of writes by any one processor must be maintained but the way in which that order is merged with the write orders of other processors is dictated by the consistency model.


The only case where it would be possible that this goal could be achieved is if by adding cores our entire working set now fits in cache thus resulting in a superlinear speedup.


apoms commented on slide_020 of Directory-Based Cache Coherence ()

It's interesting to note that we can still satisfy the properties of coherence even if Processor 0 writes to the requested cache line after step 2 instead of waiting until after step 4.

Think about it this way: the cache system only needs to maintain a totally ordered and unique sequences of writes to the same cache line. If we wait until the ack's return we are actually satisfying the total order property AND a stronger guarantee: that the history of writes is immediately propagated to all processors. This guarantee ensures that we can't have stale reads because only one version of the data can be in any cache at one time. However, this comes at a latency cost: we now must wait until the ack's are received before proceeding.

Instead, if we write to the cache line after step 2 we can avoid that extra latency at the cost of stale reads. We introduce stale reads because if we allow the write to occur after step 2, that means it could occur before the requests to invalidate have reached each processor. In that situation, Processor 1 or 2 could read their local copy of the now modified cache line and see an old version of the data. Note, however, that the total order of writes is still maintained; the difference is that each Processor might be seeing an older version of the most up to date total order of writes but not one which is in conflict with the most recent total order of writes. Another situation which might give cause for concern is if Processor 1 or 2 initiates a write to their local copy after Processor 2 has passed step 2. This is fine though because the directory responsible for that cache line can ensure that it does not process a write request for that same cache line until it receives word that all invalidation ack's have been processed (in other words, that a previous write request has been atomically completed).


@cube: I believe Ocean Sim seems like it has no cold misses because it is performing a significantly higher number of memory operations per data element than the other programs. Cold misses have a fixed upper limit proportional to the size of the working set because they only occur on the first access of some memory location. As the program makes more memory requests, this fixed amount tends toward zero. You can see this same phenomenon with Barnes-Hut where the cold misses contribute very little. This makes sense for scientific computing applications because they take a fixed sized initial state and evolve it over a long time period. During this period, they are reading and writing the same values many times.

The reason radix sort has such a high amount of cold misses is because it performs relatively few total memory operations per data element.


I was confused as to why one would want a write-back write-hit policy with a write-no-allocate write-miss policy because that seemed liked it would defeat the purpose of the write-back buffering. However, there is a case where you still get the benefit of a write-back: when you have performed a read to some cache line, any subsequent writes to that line will be buffered until the line is evicted. Write-no-allocate only applies when the cache line you are writing to is not in the cache yet. In fact, if you had a write-no-allocate cache and were planning on writing to several locations in a line, it would make sense to first read some location in that line to bring it into the cache and then perform your writes on the line. This way the writes would be buffered and you would avoid multiple direct writes to memory and the associated latency costs.


How I conceptualize these in my head:

  • Problem constrained speedup: How much faster can we get the same result?
  • Time constrained speedup: How much can we improve the result in the same time it took to get the original?
  • Memory constrained speedup: How much can we increase the rate at which we improve our result?

apoms commented on slide_013 of Parallel Programming Case Studies ()

Another benefit of the 4D blocked layout is that a hardware prefetcher might be able to predict the memory accesses and thus dramatically reduce memory latency times without involved software prefetching instructions.


@kayvonf: You could design the hardware to avoid a "cliff" by implementing a randomized address to bank assignment. As accesses from memory generally follow some pattern, it's very unlikely that one would encounter the worst case performance of completely serialized memory accesses in real programs. However by not being able to predict how your memory accesses map to memory banks you lose the ability to optimize your program for maximum performance. Thus this randomization technique can be thought of as decreasing the performance variance which influences both the maximum and minimum possible performance in practice.


This style of "Persistent threads" is particularly useful in allowing a program to take advantage of locality. For instance, let's say we have a small lookup table which fits into shared memory that contains a mapping from some input set of values to another output set of values. Lets say we have some large array of data A that we want to convert from the input set to the output set. If we took a traditional data parallel approach we would allocate a thread for each element in A, load the lookup table into shared memory for each block of threads, apply the mapping to the elements of A, and then save the mapped elements back out to global memory.

The issue here is that each time we launch and retire a new block of elements we are consuming memory bandwidth to reload our lookup table from global memory. Instead, we could use a "persistent" block that loads the lookup table once and then processes many elements of A via looping and then stops when all elements in A have been exhausted. This reduces our memory bandwidth and contention of the global memory belonging to the lookup table.


CUDA programs actually have two levels of abstraction: the language itself and a virtual GPU architecture. As long as the language compiles down to target the same virtual GPU architecture, NVIDIA can change up the CUDA language as much as they'd like without being concerned with programs written in a new CUDA version not having proper semantics on an older GPU. If the CUDA language needs hardware support for a new feature than it is easy for NVIDA to specify a new virtual GPU architecture and only allow that feature in a program if they compile to that new virtual architecture.

These two abstractions then bottom out to a single implementation of the virtual GPU architecture: a real GPU architecture. This real GPU architecture implements the virtual GPU architecture, allowing different GPU chips to have a lot of flexibility in the way the hardware is laid out without being concerned with violating some assumption made at the language level.

Additionally, compilers with knowledge of the specific real GPU the program is going to run on (for example, JIT compilers) can specialize the virtual GPU code to take advantage of the real GPU hardware characteristics in the final compiled binary.

You can find more info in the CUDA Toolkit documentation section on GPU Compilation.


apoms commented on slide_027 of Parallel Programming Basics ()

Another great example of utilizing domain knowledge to reformulate an approach to solving a problem is bump mapping.

In graphics, we approximately represent the geometry of objects in our scene using triangle meshes. However, if we want extremely high geometric detail we need a massive amount of vertex data to represent every part of the object. One way to get around this is to stay with the simple mesh of triangles but use a texture that describes the variation in surface texture (the "bumps" in the surface). This technique is called bump mapping and allows us to achieve most of the effect we would want at a much lower memory and processing cost.

As in the example above, the result is not the exact same but historically enabled much more detailed scenes for a relatively minor cost.


apoms commented on slide_036 of Parallel Programming Models ()

Why has static ILP not taken hold in a GPU context? Since the companies that produces GPUs have total vertical control over their stack (hardware, runtime, compiler, language), and even have JIT'ed code at runtime, it seems like an ideal environment for implementing something like VLIW.

Both CUDA and ISPC have similar processing element abstractions: warps for CUDA and gangs for ISPC. One of the unique side effects of these abstractions is that due to their implementation they afford much lower synchronization costs within a gang or warp. This has led to algorithms which are very fast under these models but completely infeasible in others (pthreads, for example, could not implement prefix scan the same way IPSC or CUDA does). In fact, if one implemented the same algorithm using pthreads it would be incorrect due to non-deterministic behavior (pthreads don't operate in lock step like gangs and warps do). Although they are technically exposing the same parallelism, the implementation drastically alters how one writes code for it. How can we write abstractions that make it easy for the programmer to use but also lift the details of the hardware up so that the programmer can adequately reason and state the sorts of information that can allow the compiler to make better decisions?


apoms commented on slide_060 of A Modern Multi-Core Processor ()

Where does Intel's Xeon Phi fall into this spectrum? If I remember correctly, they schedule their threads using a software scheduler (thus non single cycle scheduling time?) so I imagine they would need a coarser partitioning of work.

Additionally, Xeon Phi's have a much more coherent cache than GPUs. How does this affect the performance of programs written for the Xeon Phi?

I was also told that NVIDIA's SIMD implementation includes a PC register per lane while Intel's does not, thus requiring the user to manually manage masks for branch divergence. Is this true and is having a PC per lane the only solution? How would that even be supported in Intel's programming model?


apoms commented on slide_038 of Why Parallelism ()

How have supercomputers responded to the power wall? Several factors off the top of my head:

  • It seems like supercomputers can vary power among individual units with much higher variance because of custom cooling. Could this be used to dynamically alter clock frequency to combat dynamic load imbalance?
  • Communication latency and bandwidth are huge problems for supercomputer workloads and also huge power hogs. What are the tradeoffs between latency, bandwidth, and power for interconnects?
  • Are there quantitative ways to schedule supercomputing programs to minimize power usage?
  • Does VLIW make sense in a supercomputing context? If so and not in use, why not? There are definite power advantages to having simpler hardware.

Somewhat off topic, but as supercomputers increase in number of nodes, and failures become more common, has the SC community started to adopt distributed systems techniques (specifically fault-tolerance)?


I am curious, after now learning about both fine grained locking and transactional memory, if there are specific scenarios where fine grained locking is significantly better than TM. It seems that the two are either close in performance or TM is faster (at least for most of the examples we looked at in the TM lecture). Granted I am sure there are pathological cases where TM may be a really bad choice but, in general, it seems to be an upgrade from fine grained locking; especially when the relative complexities from the programmer's POV are considered.


scedarbaum commented on slide_014 of Parallel Computing With Spark ()

I was wondering if there was a good way to tell Spark that certain RDDs shouldn't persist in memory. This would be useful for when you only wanted 1 or 2 RDDs to persist (would likely decrease memory swapping and cache traffic). Turns out there is an aptly named "unpersist" function which does exactly that.


I've seen talks with 15 graphs and not a single one had labeled axes. Please don't ever do this.


I feel like the potency of anti-aliasing is contingent on color being linear. It's not obvious to me that this is true. Colors are weird. Like isn't yellow halfway between red and green?


ak47 commented on slide_018 of Parallel Computing With Spark ()

The bulk transfers seem to reduce the volume of the logging, but it also seems like they would increase the amount of time it takes to recover with the log since a lot of data is at stake.


Releasing locks early leads to a problem known as cascading abort or cascading rollback, where a dirty read causes multiple transactions to roll back. Holding locks until the end is called strict two phase locking and fixes this.


vrazdan commented on slide_028 of Parallel Computing With Spark ()

I wonder where the threshold is for the energy cost in the compute power needed to be resilient in cases like these gets larger than simply having two pieces of hardware operating at the same time. I would imagine for a smaller/embedded system having two hardware pieces is better than one.


For timestamp allocation, it seems even if you were to have dedicated hardware you would still run into the issue of the speed of light, and always testing your latency to the servers you're trying to communicate with in order to grant correct time orderings.


Another issue with this is that you need to predict your workload in advance, so that you don't waste energy turning the gpu on and off again (or even just specific hardware components on the gpu).


vrazdan commented on slide_041 of Interconnection Networks ()

What controlled the mux's select input? The switch who would see which channels were available and select based on that?


admintio42 commented on slide_005 of Domain-Specific Programming Systems ()

how have we not mentioned sml?


admintio42 commented on slide_028 of Parallel Computing With Spark ()

Is there a default way RDD's are computed or is that the programmer's job to tell the compiler in which order or does the compiler figure out the best way to do this?


This lecture was pretty cool. It made me want to take graphics but i'm graduating so i can't


Recently the U.S. government just banned Intel from exporting Xeon and Xeon phi to boost Tianhe-2 (http://www.extremetech.com/extreme/203195-us-government-blocks-intel-chinese-collaboration-over-nuclear-fears-national-security).

It will be interesting to see if the Chinese will try to incorporate their homegrown CPU into the supercomputer one day.


kk commented on slide_044 of Transactional Memory ()

I think n should be the number of concurrent transactions in this case instead of processes. Also, if every transaction accesses a lot of memory addresses, then the number of different memory addresses might factor in as well.


kk commented on slide_025 of Addressing the Memory Wall ()

If our address space is byte-interleaved across DRAM chips, then it's less likely that a user's memcpy will be copying an entire row of DRAM.


The next slide explains that we want the primitives to compose well. In this case, having set union and intersect makes it easy to compose the 2, and not having set difference makes the system simpler.


If the work stolen is very small, then the greedy policy might not be ideal since the overhead of stealing that piece of work might be greater than the amount of time spent stalling and having the other thread finish the work instead.


Corian commented on slide_007 of Addressing the Memory Wall ()

Out of curiosity, is this realistic? Why is only one byte loaded at a time when the buffer is so much larger than that?


Corian commented on slide_020 of Parallel Computing With Spark ()

Is there a master node in this process? How are failures detected?


I wonder why they wait three days. Etiquette? :D


A couple slides ago it was noted that partitioning a graph is expensive and difficult. Is that still the case here? It seems like having a distributed system would easily be worth the troubles if we're going to partition anyway.


In what cases would the mesh change? And how would those cases be automatically parallelized?


In OS, we talked about 3 criteria to a lock:

  1. Mutual Exclusion
    • Only 1 thread can be in the critical section at a time.
  2. Bounded Waiting
    • Once a thread tries to gain entry into the critical section, the number of threads that it has to wait for is bounded.
  3. Progress
    • One thread will always be chosen to obtain the lock.

The ticket lock ensures bounded waiting since every thread is queued.


grose commented on slide_039 of Interconnection Networks ()

But isn't that an insubstantial difference? From what I've read, packets aren't that big http://searchnetworking.techtarget.com/answer/Minimum-and-maximum-Ethernet-frame-sizes

I guess it might make a difference on chip though...


I think it's reasonable to say that 40 years ago, few would ever have thought we might "need" to have 1GB of RAM in a single machine - and yet, now we have more than 100x that in some high end machines, and it's STILL not enough for everything we want to do.

I wouldn't rule it out.


@meatie Existing database systems are unable to take advantage of multi processors and hence they are becoming extremely complex in design (to achieve high throughput). 1000 I would say is a magic number used, It is an open research topic to make databases as scalable as we can.


I agree, these tips by Kayvon are very reasonable and impactful. It will surely be helpful throughout my life.


meatie commented on slide_038 of Course Wrap Up + Presentation Tips ()

One possible reason that today's software is inefficient is that code that exploits hardware architectures are not portable. For example, one piece of efficient MPI code on this cluster may be inefficient on another cluster, since the communication pattern will be different, etc. If it's difficult to write efficient and portable code, softwares might trade this efficiency off for simplicity.


meatie commented on slide_028 of Interconnection Networks ()

It is interesting that all the network topologies showed here are in 2-D. I was wondering if there are networks designed in 3-D, like a cube, or hyper-cube.