Parallel Computer Architecture and Programming (CMU 15-418/618)
This page contains lecture slides, videos, and recommended readings for the Spring 2017 offering of 15-418/618. The full listing of lecture videos is available here.
(motivations for parallel chip decisions, challenges of parallelizing code)
Further Reading:
- The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
- Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001
(forms of parallelism + understanding Latency and BW)
Further Reading:
- CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
- Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
- NVIDIA GP100 Pascal Whitepaper. NVIDIA Technical Report 2016
(ways of thinking about parallel programs, and their corresponding hardware implementations)
Further Reading: (some fun systems)
(the thought process of parallelizing a program)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
- You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
- The Thrust Library is a useful collection library for CUDA.
- Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
- NVIDIA GeForce GTX 1080 Whitepaper. NVIDIA Technical Report 2016
- NVIDIA Tesla P100 Whitepaper. NVIDIA Technical Report 2016
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
- Pascal Tuning Guide. NVIDIA CUDA Documentation
(achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)
Further Reading:
- CilkPlus documentation
- Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
- Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
- Intel Thread Building Blocks
(message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)
(examples of optimizing parallel programs)
(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)
Further Reading:
- Roofline: An Insightful Visual Performance Model for Multicore Architectures. by S. Williams, A. Waterman, D. Patternson. CACM 2009
- NVProf Documentation
- Intel vTune
(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)
(scaling problem of snooping, implementation of directories, directory storage optimization)
Further Reading:
- Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. A Gupta et al. International Conference on Parallel Processing, 1990
- Knights Landing: 2nd Generation Intel Xeon Phi Processor. A Sodani, Hot Chips 25, 2015
(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)
(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
Further Reading:
(scale out, load balancing, elasticity, caching)
Further Reading:
- www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section)
- James Hamilton's Blog
(machine-level atomic operations, implementing locks, implementing barriers)
(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
Further Reading:
- A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001
- Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004
- Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004
- Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004
(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)
Further Reading:
- Intel Optimization Manual, Chapter 12
- Intel Restricted Transactional Memory Overview
- Students may find it helpful to read the background chapters of Austen McDonald's Thesis
(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)
Further Reading:
- NVIDIA Tegra X1 Whitepaper
- Amdahl's Law in the Multicore Era. by M. Hill and M. Marty. IEEE Computer 2008
- Understanding Sources of Inefficiency in General-Purpose Chips. by Hameed et al. ISCA 2010
- EIE: Efficient Inference Engine on Compressed Deep Neural Network. by Han et al. ISCA 2016
- Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. (HotChips 25, 2013)
(motivation for DSLs, case studies on Lizst and Halide)
Further Reading:
- Lizst: A DSL for solving mesh-based PDEs
- Halide: a language for image processing and computational photography. J. Ragan-Kelley et al. [2012, 2014]
- Automatically Scheduling Halide Image Processing Pipelines. Mullapudi et al. 2016
- Rigel: Flexible Multi-Rate Image Processing Hardware. Hegarty et al. SIGGRAPH 2016 (code)
- Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging. Z. DeVito et al. 2016
- Ebb: A DSL for Physical Simulation on CPUs and GPUs. G. Bernstein et al. 2016
- Simit: A Language for Physical Simulation. Kjolstad et al. 2015
- Why New Programming Languages for Simulation?. G. Bernstein and F. Kjolstad, TOG 2016
(GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression)
Further Reading:
- Ligra Web Site (see bottom of this page for Ligra papers)
- GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12
- GraphLab Documentation (now see Turi's site)
- Apache Spark's GraphX Framework
(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
Further Reading:
- Apache Spark Web Site
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012
- Scalability! But at What COST?. F. McSherry et al. HotOS 2015
- Accelerating Spark Workloads Using GPUs. by R. Bordawekar, 2016
- Weld: A Common Runtime for High Performance Data Analytics, by S. Palkar, CIDR 2017
- Deep Dive into Project Tungsten: Bringing Apache Spark Closer to Bare Metal. J. Rosen 2015
(intro to deep networks, what a convolution does, mapping convolution to matrix multiplication, deep network compression)
Further Reading:
- Stanford cs231: Convolutional Neural Networks for Visual Recognition. I recommend that you read through the lecture notes for modules 1 and 2 for a very nice explanation of key topics.
- Neural Networks and Deep Learning, Nielson, 2016 (a free online book)
- Check out the TensorFlow tutorials and play around in the TensorFlow Playground
- Visualizing and Understanding Convolutional Neural Networks, Zeiler and Fergus, ECCV14
- ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky et al. NIPS 2012 (this is the original "AlexNet" paper)
- Going Deeper with Convolutions, Szegedy et al. CVPR 2015 (this is the original Google Inception paper)
- SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, Iandola et al. 2016
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Han et al. ICLR 2016
- EIE: Efficient Inference Engine on Compressed Deep Neural Network, Han et al. ISCA 2016
- In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017
(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)
Further Reading:
- Scaling Distributed Machine Learning with the Parameter Server, Li et al. OSDI 2014
- Project Adam: Building an Efficient and Scalable Deep Learning Training System, Chilimbi et al. OSDI 2014
- FireCaffe: Near-linear Acceleration of Deep Neural Network Training on Compute Clusters, Iandola et al. CVPR 2016
(how DRAM works, cache compression, DRAM compression, upcoming memory technologies like 3D stacking)
Further Reading:
- Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012
- AnandTech Article on HBM2
(supercomputing vs. distributed computing/analytics, design philosophy of both systems)
(Guest lecture by Andy Pavlo)
(concurrency control in databases, transactions, two-phase locking, timestamp ordering)
Further Reading:
- Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores. Yu et al. VLDB Endowment 2014
(tips for giving a clear talk, a bit of philosophy)
Further Reading:
- Kayvon's Clear Talk Tips (expanded version)
- Do Grades Matter (a discussion about thinking beyond classes at CMU)