Lectures and Readings : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Lecture 1: Why Parallelism

Watch the Lecture

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001

Lecture 2: A Modern Multi-Core Processor

(forms of parallelism + understanding Latency and BW)

Watch the Lecture

Further Reading:

CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014

Lecture 3: Parallel Programming Models

(ways of thinking about parallel programs, and their corresponding hardware implementations)

Watch the Lecture

Lecture 4: Parallel Programming Basics

(the thought process of parallelizing a program)

Watch the Lecture

Lecture 5: GPU Architecture and CUDA Programming

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Watch the Lecture

Further Reading:

You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
The Thrust Library is a useful collection library for CUDA.
Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014
The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)

Lecture 6: Performance Optimization I: Work Distribution and Scheduling

(good work balance while minimizing the overhead of making the assignment, scheduling Cilk programs with work stealing)

Watch the Lecture

Further Reading:

CilkPlus documentation
Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998

Lecture 7: Performance Optimization II: Locality, Communication, and Contention

(message passing, async vs. blocking sends/receives, pipelining, techniques to increase arithmetic intensity, avoiding contention)

Watch the Lecture

Lecture 8: Parallel Programming Case Studies

(examples of optimizing parallel programs)

Watch the Lecture

Lecture 9: Workload-Driven Performance Evaluation

(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)

Watch the Lecture

Lecture 10: Snooping-Based Cache Coherence

(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)

Watch the Lecture

Lecture 11: Directory-Based Cache Coherence

(scaling problem of snooping, implementation of directories, directory storage optimization)

Watch the Lecture

Lecture 12: A Basic Snooping-Based Multi-Processor Implementation

(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)

Watch the Lecture

Lecture 13: Memory Consistency

(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)

Watch the Lecture

Further Reading:

A Primer on Memory Consistency and Coherence. by Sorin, Hill, and Wood.
A Useful Page with Links to a Bunch of Resources on Relaxed Memory Models

Lecture 14: Scaling a Web Site

(scale out, load balancing, elasticity, caching)

Watch the Lecture

Further Reading:

www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section)
James Hamilton's Blog

Lecture 15: Interconnection Networks

(network properties, topology, basics of flow control)

Watch the Lecture

Lecture 16: Implementing Synchronization

(machine-level atomic operations, implementing locks, implementing, barriers)

Watch the Lecture

Lecture 17: Fine-Grained Synchronization and Lock-Free Programming

(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)

Watch the Lecture

Further Reading:

A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001
Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004
Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004

Lecture 18: Transactional Memory

(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)

Watch the Lecture

Lecture 19: Heterogeneous Parallelism and Hardware Specialization

(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)

Watch the Lecture

Lecture 20: Domain-Specific Programming Systems

(motivation for DSLs, case studies on Lizst and Halide)

Further Reading:

Lizst: A DSL for solving mesh-based PDEs
Halide: a language for image processing and computational photography
Darkroom: Darkroom: Compiling High-Level Image Processing Code into Hardware Pipelines. Hegarty et al. SIGGRAPH 2014
Simit: A Language for Physical Simulation. Kjolstad et al. 2015

Lecture 21: Domain-Specific Programming on Graphs

(GraphLab abstractions, GraphLab implementation, streaming graph processing, graph compression)

Watch the Lecture

Further Reading:

GraphLab Documentation
Ligra: A Lightweight Graph Processing Framework for Shared Memory. by Shun and Blelloch, PPOPP 13
GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12

Lecture 22: In-Memory Distributed Computing using Spark

(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)

Watch the Lecture

Further Reading:

Apache Spark Web Site
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012

Lecture 23: Addressing the Memory Wall

(how DRAM works, cache compression, DRAM compression, upgoing memory technologies)

Watch the Lecture

Further Reading:

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012

Lecture 24: The Future of High-Performance Computing

(supercomputing vs. distributed computing/analytics, design philosophy of both systems)

Watch the Lecture

Lecture 25: Efficiently Evaluating Deep Networks

(intro to deep networks, what convolution does, mapping convolutin to matrix multiplication, deep network compression)

Watch the Lecture

Lecture 26: Parallel Deep Network Training

(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)

Watch the Lecture

Lecture 27: Parallelizing the 3D Graphics Pipeline

(parallel rasterization, Z/color-buffer compression, tiled rendering, sort-everywhere parallel rendering)

Watch the Lecture

Lecture 28: Course Wrap Up + How to Give a Talk

(tips for giving a clear talk, a bit of philosophy)

Watch the Lecture

Student Final Projects

(the students explore high-performance and high-efficiency topics of their choosing)

Watch the Parallelism Competition Finalists