Parallel Computer Architecture and Programming (CMU 15-418/618)

This page contains lecture slides, videos, and recommended readings for the Spring 2016 offering of 15-418/618. The full listing of lecture videos is available on the Panopto site here.

Further Reading:
(forms of parallelism + understanding Latency and BW)
Further Reading:
(ways of thinking about parallel programs, and their corresponding hardware implementations)
(the thought process of parallelizing a program)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
(good work balance while minimizing the overhead of making the assignment, scheduling Cilk programs with work stealing)
(message passing, async vs. blocking sends/receives, pipelining, techniques to increase arithmetic intensity, avoiding contention)
(examples of optimizing parallel programs)
(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)
(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)
(scaling problem of snooping, implementation of directories, directory storage optimization)
(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)
(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
(scale out, load balancing, elasticity, caching)
Further Reading:
(network properties, topology, basics of flow control)
(machine-level atomic operations, implementing locks, implementing, barriers)
(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
Further Reading:
(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)
(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)
(GraphLab abstractions, GraphLab implementation, streaming graph processing, graph compression)
(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
(how DRAM works, cache compression, DRAM compression, upgoing memory technologies)
(supercomputing vs. distributed computing/analytics, design philosophy of both systems)
(intro to deep networks, what convolution does, mapping convolutin to matrix multiplication, deep network compression)
(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)
(parallel rasterization, Z/color-buffer compression, tiled rendering, sort-everywhere parallel rendering)
(tips for giving a clear talk, a bit of philosophy)
(the students explore high-performance and high-efficiency topics of their choosing)