(from slide 41:) Modern computers are parallelized in a few ways:
via multiple processing cores, and
via multiple ALUs with a core.
The former allows for multi-thread parallelism, while the latter is good for tasks with "data-parallel workloads" (i.e. accessing the same index of multiple vectors as one iteration of a loop).
Memory access is often a bottleneck for parallelized tasks. The example from slide 65 demonstrates how a large problem can be efficiently parallelized, but performance is thwarted by the limited bandwidth of the memory bus, even on a GPU. To overcome this, a well-parallelized program will (from slide 67:)
minimize the number of memory accesses, reusing data within and between threads, and
prioritize arithmetic over memory requests when possible, since ALUs aren't exactly a limited resource.
The four (I guess?) key concepts summarized: