Lectures and Readings : Parallel Computer Architecture and Programming : 15-418/618 Spring 2017

Parallel Computer Architecture and Programming (CMU 15-418/618)

This page contains lecture slides, videos, and recommended readings for the Spring 2017 offering of 15-418/618. The full listing of lecture videos is available here.

Lecture 1: Why Parallelism? Why Efficiency?

(motivations for parallel chip decisions, challenges of parallelizing code)

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001

Lecture 2: A Modern Multi-Core Processor

(forms of parallelism + understanding Latency and BW)

Further Reading:

CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
NVIDIA GP100 Pascal Whitepaper. NVIDIA Technical Report 2016

Lecture 3: Parallel Programming Models

(ways of thinking about parallel programs, and their corresponding hardware implementations)

Further Reading:

You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
The Thrust Library is a useful collection library for CUDA.
Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
NVIDIA GeForce GTX 1080 Whitepaper. NVIDIA Technical Report 2016
NVIDIA Tesla P100 Whitepaper. NVIDIA Technical Report 2016
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
Pascal Tuning Guide. NVIDIA CUDA Documentation

Lecture 6: Performance Optimization I: Work Distribution and Scheduling

(achieving good work distribution while minimizing overhead, scheduling Cilk programs with work stealing)

Further Reading:

CilkPlus documentation
Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
Intel Thread Building Blocks

Lecture 7: Performance Optimization II: Locality, Communication, and Contention

(message passing, async vs. blocking sends/receives, pipelining, increasing arithmetic intensity, avoiding contention)

Lecture 8: Parallel Programming Case Studies

(examples of optimizing parallel programs)

Lecture 9: Workload-Driven Performance Evaluation

(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)

Further Reading:

Roofline: An Insightful Visual Performance Model for Multicore Architectures. by S. Williams, A. Waterman, D. Patternson. CACM 2009
NVProf Documentation
Intel vTune

Lecture 10: Snooping-Based Cache Coherence

(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)

Lecture 11: Directory-Based Cache Coherence

(scaling problem of snooping, implementation of directories, directory storage optimization)

Further Reading:

Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. A Gupta et al. International Conference on Parallel Processing, 1990
Knights Landing: 2nd Generation Intel Xeon Phi Processor. A Sodani, Hot Chips 25, 2015

Lecture 12: A Basic Snooping-Based Multi-Processor Implementation

(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)

Lecture 13: Memory Consistency

(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)

Further Reading:

A Primer on Memory Consistency and Coherence. by Sorin, Hill, and Wood.
A Useful Page with Links to a Bunch of Resources on Relaxed Memory Models

Lecture 14: Scaling a Web Site

(scale out, load balancing, elasticity, caching)

Further Reading:

www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section)
James Hamilton's Blog

Lecture 15: Interconnection Networks

(network properties, topology, basics of flow control)

Lecture 16: Implementing Synchronization

(machine-level atomic operations, implementing locks, implementing barriers)

Lecture 17: Fine-Grained Synchronization and Lock-Free Programming

(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)

Further Reading:

A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001
Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004
Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004

Lecture 18: Transactional Memory

(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)

Further Reading:

Intel Optimization Manual, Chapter 12
Intel Restricted Transactional Memory Overview
Students may find it helpful to read the background chapters of Austen McDonald's Thesis

Lecture 19: Heterogeneous Parallelism and Hardware Specialization

(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)

Further Reading:

NVIDIA Tegra X1 Whitepaper
Amdahl's Law in the Multicore Era. by M. Hill and M. Marty. IEEE Computer 2008
Understanding Sources of Inefficiency in General-Purpose Chips. by Hameed et al. ISCA 2010
EIE: Efficient Inference Engine on Compressed Deep Neural Network. by Han et al. ISCA 2016
Qualcomm Hexagon DSP: An architecture optimized for mobile multimedia and communications. (HotChips 25, 2013)

Lecture 20: Domain-Specific Programming Systems

(motivation for DSLs, case studies on Lizst and Halide)

Further Reading:

Lizst: A DSL for solving mesh-based PDEs
Halide: a language for image processing and computational photography. J. Ragan-Kelley et al. [2012, 2014]
Automatically Scheduling Halide Image Processing Pipelines. Mullapudi et al. 2016
Rigel: Flexible Multi-Rate Image Processing Hardware. Hegarty et al. SIGGRAPH 2016 (code)
Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging. Z. DeVito et al. 2016
Ebb: A DSL for Physical Simulation on CPUs and GPUs. G. Bernstein et al. 2016
Simit: A Language for Physical Simulation. Kjolstad et al. 2015
Why New Programming Languages for Simulation?. G. Bernstein and F. Kjolstad, TOG 2016

Lecture 21: Domain-Specific Programming on Graphs

(GraphLab, Ligra, and GraphChi, streaming graph processing, graph compression)

Further Reading:

Ligra Web Site (see bottom of this page for Ligra papers)
GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12
GraphLab Documentation (now see Turi's site)
Apache Spark's GraphX Framework

Lecture 22: In-Memory Distributed Computing using Spark

(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)

Further Reading:

Apache Spark Web Site
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012
Scalability! But at What COST?. F. McSherry et al. HotOS 2015
Accelerating Spark Workloads Using GPUs. by R. Bordawekar, 2016
Weld: A Common Runtime for High Performance Data Analytics, by S. Palkar, CIDR 2017
Deep Dive into Project Tungsten: Bringing Apache Spark Closer to Bare Metal. J. Rosen 2015

Lecture 23: Efficiently Evaluating Deep Networks

(intro to deep networks, what a convolution does, mapping convolution to matrix multiplication, deep network compression)

Further Reading:

Stanford cs231: Convolutional Neural Networks for Visual Recognition. I recommend that you read through the lecture notes for modules 1 and 2 for a very nice explanation of key topics.
Neural Networks and Deep Learning, Nielson, 2016 (a free online book)
Check out the TensorFlow tutorials and play around in the TensorFlow Playground
Visualizing and Understanding Convolutional Neural Networks, Zeiler and Fergus, ECCV14
ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky et al. NIPS 2012 (this is the original "AlexNet" paper)
Going Deeper with Convolutions, Szegedy et al. CVPR 2015 (this is the original Google Inception paper)
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, Iandola et al. 2016
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Han et al. ICLR 2016
EIE: Efficient Inference Engine on Compressed Deep Neural Network, Han et al. ISCA 2016
In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017

Lecture 24: Parallel Deep Network Training

(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)

Further Reading:

Scaling Distributed Machine Learning with the Parameter Server, Li et al. OSDI 2014
Project Adam: Building an Efficient and Scalable Deep Learning Training System, Chilimbi et al. OSDI 2014
FireCaffe: Near-linear Acceleration of Deep Neural Network Training on Compute Clusters, Iandola et al. CVPR 2016

Lecture 25: Addressing the Memory Wall

(how DRAM works, cache compression, DRAM compression, upcoming memory technologies like 3D stacking)

Further Reading:

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012
AnandTech Article on HBM2

Lecture 26: The Future of High-Performance Computing

(supercomputing vs. distributed computing/analytics, design philosophy of both systems)

Lecture 27: Database Systems Do Not Scale to 1000 CPU Cores

(Guest lecture by Andy Pavlo)

(concurrency control in databases, transactions, two-phase locking, timestamp ordering)

Further Reading:

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores. Yu et al. VLDB Endowment 2014

Lecture 28: Course Wrap Up + How to Give a Talk

(tips for giving a clear talk, a bit of philosophy)

Further Reading:

Kayvon's Clear Talk Tips (expanded version)
Do Grades Matter (a discussion about thinking beyond classes at CMU)