Parallel Computer Architecture and Programming (CMU 15-418/618)
This page contains lecture slides, videos, and recommended readings for the Spring 2016 offering of 15-418/618. The full listing of lecture videos is available on the Panopto site here.
Further Reading:
- The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
- Power: A First-Class Architectural Design Constraint. by Trevor Mudge IEEE Computer 2001
(forms of parallelism + understanding Latency and BW)
Further Reading:
- CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern throughput processor)
- Intel's Haswell CPU Microarchitecture. D. Kanter, 2013 (realworldtech.com article)
- NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014
(ways of thinking about parallel programs, and their corresponding hardware implementations)
(the thought process of parallelizing a program)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
- You may enjoy the free Udacity Course: Intro to Parallel Programming Using CUDA, by Luebke and Owens
- The Thrust Library is a useful collection library for CUDA.
- Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) a nice overview of GPU history.
- NVIDIA GeForce GTX 980 Whitepaper. NVIDIA Technical Report 2014
- The Compute Architecture of Intel Processor Graphics. Intel Technical Report, 2015 (a very nice description of a modern Intel integrated GPU)
(good work balance while minimizing the overhead of making the assignment, scheduling Cilk programs with work stealing)
Further Reading:
- CilkPlus documentation
- Scheduling Multithreaded Computations by Work Stealing. by Blumofe and Leiserson, JACM 1999
- Implementation of the Cilk 5 Multi-Threaded Language. by Frigo et al. PLDI 1998
(message passing, async vs. blocking sends/receives, pipelining, techniques to increase arithmetic intensity, avoiding contention)
(examples of optimizing parallel programs)
(hard vs. soft scaling, memory-constrained scaling, scaling problem size, tips for analyzing code performance)
(definition of memory coherence, invalidation-based coherence using MSI and MESI, maintaining coherence with multi-level caches, false sharing)
(scaling problem of snooping, implementation of directories, directory storage optimization)
(deadlock, livelock, starvation, implementation of coherence on an atomic and split-transaction bus)
(consistency vs. coherence, relaxed consistency models and their motivation, acquire/release semantics)
Further Reading:
(scale out, load balancing, elasticity, caching)
Further Reading:
- www.highscalability.com. A cool site with many case studies (see "All Time Favorites" section)
- James Hamilton's Blog
(network properties, topology, basics of flow control)
(machine-level atomic operations, implementing locks, implementing, barriers)
(fine-grained snychronization via locks, basics of lock-free programming: single-reader/writer queues, lock-free stacks, the ABA problem, hazard pointers)
Further Reading:
- A Pragmatic Implementation of Non-Blocking Linked-Lists. by T. Harris, 2001
- Lock-Free Linked Lists and Skip Lists. by M. Fomitchev and E. Ruppert, 2004
- Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. by M. Michael, IEEE Trans on Parallel and Distributed Systems, 2004
- Lock-Free Data Structures with Hazard pointers. by A. Alexandrescu and M. Michael, Dr. Dobbs, 2004
(motivation for transactions, design space of transactional memory implementations, lazy-optimistic HTM)
(energy-efficient computing, motivation for heterogeneous processing, fixed-function processing, FPGAs, what's in a modern SoC)
(motivation for DSLs, case studies on Lizst and Halide)
Further Reading:
(GraphLab abstractions, GraphLab implementation, streaming graph processing, graph compression)
Further Reading:
- GraphLab Documentation
- Ligra: A Lightweight Graph Processing Framework for Shared Memory. by Shun and Blelloch, PPOPP 13
- GraphChi: Large-Scale Graph Computation on Just a PC. by Kyrola et al. OSDI 12
(producer-consumer locality, RDD abstraction, Spark implementation and scheduling)
Further Reading:
- Apache Spark Web Site
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. by Zaharia et al. NSDI 2012
(how DRAM works, cache compression, DRAM compression, upgoing memory technologies)
Further Reading:
- Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches. by Pekhimenko et al. PACT 2012
(supercomputing vs. distributed computing/analytics, design philosophy of both systems)
(intro to deep networks, what convolution does, mapping convolutin to matrix multiplication, deep network compression)
(basics of gradient descent and backpropagation, memory footpring issues, asynchronous parallel implementations of gradient descent)
(parallel rasterization, Z/color-buffer compression, tiled rendering, sort-everywhere parallel rendering)
(tips for giving a clear talk, a bit of philosophy)
(the students explore high-performance and high-efficiency topics of their choosing)