Slide View : Parallel Computer Architecture and Programming : 15-418/618 Spring 2016

Previous | Next --- Slide 49 of 57

cmusam

With separate foo and bar functions, tmp is stored to and then loaded from memory (unnecessary bandwidth usage). The compiler may detect this and optimize by just storing tmp in a buffer, e.g. register.

stride16

The code provided on the slide only reads from memory once, written as:

output[i] = bar(foo(input[x]));

This code can be written in several different ways, which can lead it to being misinterpreted.