Parallel Computer Architecture and Programming (CMU 15-418/618)
This page contains lecture slides, videos, and recommended readings for the Spring 2015 offering of 15-418/618. The full listing of lecture videos is available on the Panopto site here.
Further Reading:
- The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
(forms of parallelism + understanding Latency and BW)
Further Reading:
- CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)
(and their corresponding hardware implementations)
(the thought process of parallelizing a program)
(CUDA programming abstractions, and how they are implemented on modern GPUs)
Further Reading:
- You may enjoy the free Coursera Course Intro to Parallel Programming Using CUDA by Luebke and Owens
- The Thrust Library is a useful collection library for CUDA.
- Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) is for a nice overview of GPU history.
(the tension between achieving good work balance and minimizing the overhead of making the assignment)
(techniques for reducing communication and contention, inherent vs. artifactual communication)
(a few examples of parallelizing algorithms)
(evaluating program performance, how to scale performance analysis "up" and "down")
(the basics of cache coherence, the MSI and MESI protocols)
(evaluating the performance of snooping implementations, coherence in a multi-level cache hierarchy)
(why directories enable scalable cache coherence, reducing the overhead of directory storage)
(the motivation for and implications of relaxed consistency memory models)
(scale-out parallelism, elasticity, and significant amounts of caching)
Further Reading:
- Highscalability.com is a great web site with case studies. (Check our their "real-life architectures" section.)
- James Hamilton's Blog
(the challenges of implementing invalidation-based coherence in a real system)
(challenges of fine-grained locking, basics of lock-free data structures)
(network-on-a-chip topologies and flow-control algorithms)
(motivation for transactional memory, the design space of implementations)
(the basics of Cilk's locality-aware, work-stealing scheduler)
Further Reading:
- The Implementation of the Cilk-5 Multithreaded Language. Frigo et al. PLDI 1998
- Scheduling Multithreaded Computations by Work Stealing. Blumofe et al. Journal of the ACM 1999
(area and energy-efficient computing via heterogeneous parallel processors)
Further Reading:
- Amdahl's Law in the Multi-Core Era, Hill and Marty, IEEE Computer 2008.
- Dark Silicon and the End of Multi-Core Scaling, Esmaeilzadeh et al. ISCA 2011.
- The Future of GPU Computing, Supercomputing 2009 Conference talk by Bill Dally (contains interesting slides on power consumption)
(motivation for domain-specific systems, two example systems: Liszt and Halide)
(examples from GraphLab, Ligra, and Green-Marl, discussion of what makes a good programming system)
Further Reading:
- Spark GraphX Project Page
- Ligra: A Lightweight Graph Processing Framework for Shared Memory Shun et al. PPoPP 13
- GraphChi: Large-Scale Graph Computation on Just a PC A. Kyrola et al. OSDI 12
(parallelization issues in modern databases, a lecture by Andy Pavlo)
(the RDD abstraction and how it enables efficient, distributed processing)
Further Reading:
- Spark Project Page
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. M. Zaharia et al. NSDI 2012
(how DRAM works and modern hardware approaches to improving locality and bandwidth)
(triangle rasterization as a sampling problem, parallel rasterization, HW z-buffer compression)
This lecture was a bonus lecture and was not recorded.
(Exam 2 review, how to give a good talk, course summary)