Lectures and Readings : Parallel Computer Architecture and Programming : 15-418/618 Spring 2015

Lecture 1: Why Parallelism

Watch the Lecture

Further Reading:

The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005

Lecture 2: A Modern Multi-Core Processor

(forms of parallelism + understanding Latency and BW)

Watch the Lecture

Further Reading:

CPU DB: Recording Microprocessor History. A. Danowitz, K. Kelley, J. Mao, J.P. Stevenson, M. Horowitz, ACM Queue 2005. (You can also take a peak at the CPU DB website)

Lecture 3: Parallel Programming Models

(and their corresponding hardware implementations)

Watch the Lecture

Lecture 4: Parallel Programming Basics

(the thought process of parallelizing a program)

Watch the Lecture

Lecture 5: GPU Architecture and CUDA Programming

(CUDA programming abstractions, and how they are implemented on modern GPUs)

Watch the Lecture

Further Reading:

You may enjoy the free Coursera Course Intro to Parallel Programming Using CUDA by Luebke and Owens
The Thrust Library is a useful collection library for CUDA.
Rise of the Graphics Processor. D. Blythe (Proceedings of IEEE 2008) is for a nice overview of GPU history.

Lecture 6: Program Optimization I: Work Assignment

(the tension between achieving good work balance and minimizing the overhead of making the assignment)

Watch the Lecture

Lecture 7: Program Optimization II: Locality, Communication, and Contention

(techniques for reducing communication and contention, inherent vs. artifactual communication)

Watch the Lecture

Lecture 8: Parallel Application Case Studies

(a few examples of parallelizing algorithms)

Watch the Lecture

Lecture 9: Workload-Driven Performance Evaluation

(evaluating program performance, how to scale performance analysis "up" and "down")

Watch the Lecture

Lecture 10: Snooping-Based Cache Coherence I

(the basics of cache coherence, the MSI and MESI protocols)

Watch the Lecture

Lecture 11: Snooping-Based Cache Coherence II

(evaluating the performance of snooping implementations, coherence in a multi-level cache hierarchy)

Watch the Lecture

Lecture 12: Directory-Based Cache Coherence

(why directories enable scalable cache coherence, reducing the overhead of directory storage)

Watch the Lecture

Lecture 13: Relaxed Memory Consistency

(the motivation for and implications of relaxed consistency memory models)

Watch the Lecture

Lecture 14: Scaling a Web Site

(scale-out parallelism, elasticity, and significant amounts of caching)

Watch the Lecture

Further Reading:

Highscalability.com is a great web site with case studies. (Check our their "real-life architectures" section.)
James Hamilton's Blog

Lecture 15: A Basic Snooping-Based Multi-Processor Implementation

(the challenges of implementing invalidation-based coherence in a real system)

Watch the Lecture

Lecture 16: Implementing Basic Synchronization

(implementing locks and barriers)

Watch the Lecture

Lecture 17: Fine-Grained Synchronization and Lock-Free Programming

(challenges of fine-grained locking, basics of lock-free data structures)

Watch the Lecture

Lecture 18: Interconnection Networks

(network-on-a-chip topologies and flow-control algorithms)

Watch the Lecture

Lecture 19: Transactional Memory

(motivation for transactional memory, the design space of implementations)

Watch the Lecture

Lecture 20: Scheduling Fork-Join Parallelism

(the basics of Cilk's locality-aware, work-stealing scheduler)

Watch the Lecture

Further Reading:

The Implementation of the Cilk-5 Multithreaded Language. Frigo et al. PLDI 1998
Scheduling Multithreaded Computations by Work Stealing. Blumofe et al. Journal of the ACM 1999

Lecture 21: Heterogeneous Parallelism and Hardware Specialization

(area and energy-efficient computing via heterogeneous parallel processors)

Watch the Lecture

Further Reading:

Amdahl's Law in the Multi-Core Era, Hill and Marty, IEEE Computer 2008.
Dark Silicon and the End of Multi-Core Scaling, Esmaeilzadeh et al. ISCA 2011.
The Future of GPU Computing, Supercomputing 2009 Conference talk by Bill Dally (contains interesting slides on power consumption)

Lecture 22: Domain-Specific Parallel Programming Systems

(motivation for domain-specific systems, two example systems: Liszt and Halide)

Watch the Lecture

Further Reading:

Lecture 23: Domain-Specific Programming Systems for Graph Processing

(examples from GraphLab, Ligra, and Green-Marl, discussion of what makes a good programming system)

Watch the Lecture

Further Reading: