Parallel Computer Architecture and Programming

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

This course was held at Tsinghua University in Summer 2017. The CMU version of this course is 15-418/618.

Course Information
Instructors: Kayvon Fatahalian and Wei Xue
See the course information page for more info on policies and logistics.
Here is a listing of all lecture times and assignment deadlines.
Tsinghua Summer 2017 Lecture Schedule
Jun 27
(motivations for parallel chip decisions, challenges of parallelizing code)
Watch the Lecture:
Further Reading:
Jun 27
Jun 28
(forms of parallelism + understanding latency and bandwidth)
Watch the Lecture:
Further Reading:
Jun 28
(abstraction vs. implementation, how the SPMD programming model maps to SIMD hardware)
Watch the Lecture:
Further Reading:
Jun 30
(the thought process of parallelizing a program, parallel programming models, Amdahl's law)
Watch the Lecture:
Jul 3
(achieving good work distribution while minimizing overhead, scheduling Cilk programs, work stealing)
Jul 5
(analyzing parallel algorithms via work and span)
Watch the Lecture:
Jul 5
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Watch the Lecture:
Further Reading:
Jul 6
(message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)
Watch the Lecture:
Jul 10
(definition of memory coherence, invalidation-based coherence via MSI/MESI, snooping and directory schemes, false sharing)
Watch the Lecture:
Further Reading:
Jul 12
(deadlock, livelock, implementation of coherence on an atomic and split-transaction bus)
Watch the Lecture:
Jul 12
(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
Jul 13
(fine-grained snychronization via locks, lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
Watch the Lecture:
Further Reading:
Jul 17
(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, modern SoC's)
Jul 19
(motivation for DSLs, case studies on Lizst and Halide)
Jul 20
(GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression)
Further Reading:
Jul 20
(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
Jul 24
(intro to DNNs, what a convolution does, mapping convolution to matrix multiplication, deep network compression, parameter/flop efficient DNN topologies)
Further Reading:
Jul 24
(basics of gradient descent and backpropagation, memory footprint issues, synchronous and asynchronous parallel implementations of gradient descent, parameter server)
Jul 26
(how DRAM works, cache compression, DRAM compression, 3D stacking)
Jul 27
(scale out thinking, load balancing, elasticity, caching)
Further Reading:
Jul 27
(course summary, post-course opportunities, undergraduate research)