Parallel Computer Architecture and Programming (CMU 15-418)
Further Reading:
The Future of Microprocessors. by K. Olukotun and L. Hammond, ACM Queue 2005
(Forms of Parallelism + Understanding Latency and Bandwidth)
(and Their Corresponding Hardware/Software Implementations)
Further Reading:
Abstraction vs. Implementation by 15-418 students
ISPC: A SPMD Compiler for High-Performance CPU Programming. by Matt Pharr and William R. Mark
Further Reading:
Decomposition, Assignment, and Orchestration of K-Means Clustering by 15-418 students
Further Reading:
How CUDA's Abstractions Map to a GPU Implementation by 15-418 students
Further Reading:
Workload-Driven Performance Evaluation by 15-418 students
Analyzing a Program Using Performance Tools by Alex Reece
Further Reading:
Optimizations in Direct-Based Coherence Schemes by 15-418 students
(+ A Mid-Semester Course Review)
Further Reading:
Design Considerations in Implementing Cache Coherence by 15-418 students
Further Reading:
Lock-Free Algorithms by 15-418 students
The Implementation of Lock-free Stacks and Linked Lists by 15-418 students
Further Reading:
Case Studies of Xbox 360, Tegra 4i, and iPhone 3Gs by 15-418 students
Amdahl's Law in the Multi-Core Era by M. D. Hill and M. R. Marty
[Check our their Amdahl's Law calculator]
Further Reading:
Summary of Domain-Specific Programming on Graphs by 15-418 students
GraphLab: A New Parallel Framework for Machine Learning by Y. Low et al., UAI 2010 (www.graphlab.org)
Ligra: A Lightweight Graph Processing Framework for Shared Memory by J. Shun and G. Blelloch, PPoPP 13
Green-Marl: A DSL for Easy and Efficient Graph Analysis by S. Hong et al. ASPLOS 12