Parallel Computer Architecture and Programming

From smart phones, to multi-core CPUs and GPUs, to the world's largest supercomputers and web sites, parallel processing is ubiquitous in modern computing. The goal of this course is to provide a deep understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. Because writing good parallel programs requires an understanding of key machine performance characteristics, this course will cover both parallel hardware and software design.

This course was held at Tsinghua University in Summer 2017. The CMU version of this course is 15-418/618.

Course Information

Instructors: Kayvon Fatahalian and Wei Xue

See the course information page for more info on policies and logistics.

Here is a listing of all lecture times and assignment deadlines.

Tsinghua Summer 2017 Lecture Schedule

Jun 27	Why Parallelism? Why Efficiency? (motivations for parallel chip decisions, challenges of parallelizing code) Watch the Lecture: Bilibili Youtube Further Reading: The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005 Power: A First-Class Architectural Design Constraint. by Trevor Mudge, IEEE Computer 2001
Jun 27 Jun 28	A Modern Multi-Core Processor: Forms of Parallelism + Understanding Memory Latency and Bandwidth (forms of parallelism + understanding latency and bandwidth) Watch the Lecture: Bilibili: part 1 (start at 64:20), part 2 Youtube: part 1, part 2 Further Reading: CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article) The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor) NVIDIA GP100 Pascal Whitepaper. NVIDIA Technical Report 2016
Jun 28	The SPMD Programming Model (Abstraction vs. Implementation) (abstraction vs. implementation, how the SPMD programming model maps to SIMD hardware) Watch the Lecture: Bilibili (start at 59:00) Youtube Further Reading: ISPC Programmer's Manual
Jun 30	Parallel Programming Basics (the thought process of parallelizing a program, parallel programming models, Amdahl's law) Watch the Lecture: Bilibili Youtube
Jul 3	Performance Optimization I: Work Distribution and Scheduling (achieving good work distribution while minimizing overhead, scheduling Cilk programs, work stealing) Watch the Lecture: Bilibili Youtube Further Reading: CilkPlus documentation Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999 Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998 Intel Thread Building Blocks
Jul 5	Performance Analysis of Parallel Algorithms (invited lecture by Yan Gu) (analyzing parallel algorithms via work and span) Watch the Lecture: Bilibili Youtube
Jul 5	GPU Architecture and CUDA Programming (CUDA programming abstractions, and how they are implemented on modern GPUs) Watch the Lecture: Bilibili Youtube Further Reading: You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens The Thrust Library is a useful collection library for CUDA. Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history. NVIDIA GeForce GTX 1080 Whitepaper. NVIDIA Technical Report 2016 NVIDIA Tesla P100 Whitepaper. NVIDIA Technical Report 2016 The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU) Pascal Tuning Guide. NVIDIA CUDA Documentation
Jul 6	Performance Optimization II: Locality, Communication, and Contention (message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention) Watch the Lecture: Bilibili Youtube
Jul 10	Cache Coherence (definition of memory coherence, invalidation-based coherence via MSI/MESI, snooping and directory schemes, false sharing) Watch the Lecture: Bilibili Youtube Further Reading: A Primer on Memory Consistency and Coherence. by Sorin, Hill, and Wood. (Ch. 1,2,6,7)
Jul 12	Basic Snooping-Based Coherence Implementation (deadlock, livelock, implementation of coherence on an atomic and split-transaction bus) Watch the Lecture: Bilibili Youtube
Jul 12	Memory Consistency (consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics) Watch the Lecture: Bilibili Youtube Further Reading: A Primer on Memory Consistency and Coherence. by Sorin, Hill, and Wood. (Ch. 1,3,4,5) A Useful Page with Links to a Bunch of Resources on Relaxed Memory Models
Jul 13	Fine-Grained Synchronization and Lock-Free Programming (fine-grained snychronization via locks, lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers) Watch the Lecture: Bilibili Youtube Further Reading: A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001 Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004 Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004 Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004
Jul 17	Heterogeneous Parallelism and Hardware Specialization (energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, modern SoC's) Further Reading: NVIDIA Tegra X1 Whitepaper Amdahl's Law in the Multicore Era. by M. Hill and M. Marty. IEEE Computer 2008 Understanding Sources of Inefficiency in General-Purpose Chips. by Hameed et al. ISCA 2010 In-Datacenter Performance Analysis of a Tensor Processing Unit. by Jouppi et al. ISCA 2017 Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. (HotChips 25, 2013)
Jul 19	Domain-Specific Programming Systems (motivation for DSLs, case studies on Lizst and Halide) Further Reading: Lizst: A DSL for solving mesh-based PDEs Halide: a language for image processing and computational photography. J. Ragan-Kelley et al. [2012, 2014] Automatically Scheduling Halide Image Processing Pipelines. Mullapudi et al. 2016 Rigel: Flexible Multi-Rate Image Processing Hardware. Hegarty et al. SIGGRAPH 2016 (code) Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging. Z. DeVito et al. 2016 Ebb: A DSL for Physical Simulation on CPUs and GPUs. G. Bernstein et al. 2016 Simit: A Language for Physical Simulation. Kjolstad et al. 2015 Why New Programming Languages for Simulation?. G. Bernstein and F. Kjolstad, TOG 2016
Jul 20	Domain-Specific Programming for Graphs (GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression) Further Reading: Ligra Web Site (see bottom of this page for Ligra papers) GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12 GraphLab Documentation (now see Turi's site) Apache Spark's GraphX Framework
Jul 20	In-Memory Distributed Computing using Spark (producer-consumer locality, RDD abstraction, Spark implementation and scheduling) Further Reading: Apache Spark Web Site Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012 Scalability! But at What COST?. F. McSherry et al. HotOS 2015 Accelerating Spark Workloads Using GPUs. by R. Bordawekar, 2016 Weld: A Common Runtime for High Performance Data Analytics, by S. Palkar, CIDR 2017 Deep Dive into Project Tungsten: Bringing Apache Spark Closer to Bare Metal. J. Rosen 2015
Jul 24	Efficiently Evaluating Deep Neural Networks (intro to DNNs, what a convolution does, mapping convolution to matrix multiplication, deep network compression, parameter/flop efficient DNN topologies) Further Reading: Stanford cs231: Convolutional Neural Networks for Visual Recognition. I recommend that you read through the lecture notes for modules 1 and 2 for a very nice explanation of key topics. Neural Networks and Deep Learning, Nielson, 2016 (a free online book) Check out the TensorFlow tutorials and play around in the TensorFlow Playground Visualizing and Understanding Convolutional Neural Networks, Zeiler and Fergus, ECCV14 ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky et al. NIPS 2012 (this is the original "AlexNet" paper) Going Deeper with Convolutions, Szegedy et al. CVPR 2015 (this is the original Google Inception paper) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, Iandola et al. 2016 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, Howard et al. 2017 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Han et al. ICLR 2016 EIE: Efficient Inference Engine on Compressed Deep Neural Network, Han et al. ISCA 2016 In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017
Jul 24	Parallel Training of Deep Neural Networks (basics of gradient descent and backpropagation, memory footprint issues, synchronous and asynchronous parallel implementations of gradient descent, parameter server) Further Reading: Scaling Distributed Machine Learning with the Parameter Server, Li et al. OSDI 2014 Project Adam: Building an Efficient and Scalable Deep Learning Training System, Chilimbi et al. OSDI 2014 FireCaffe: Near-linear Acceleration of Deep Neural Network Training on Compute Clusters, Iandola et al. CVPR 2016 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al. 2017
Jul 26	Addressing the Memory Wall (how DRAM works, cache compression, DRAM compression, 3D stacking) Further Reading: Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012 AnandTech Article on HBM2
Jul 27	Scaling a Web Site (scale out thinking, load balancing, elasticity, caching) Further Reading: www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section) James Hamilton's Blog
Jul 27	Course Wrap Up (course summary, post-course opportunities, undergraduate research) Further Reading: Applying to Ph.D. Programs in Computer Science, by CMU Professor Mor Harchol-Balter