With separate foo and bar functions, tmp is stored to and then loaded from memory (unnecessary bandwidth usage). The compiler may detect this and optimize by just storing tmp in a buffer, e.g. register.
The code provided on the slide only reads from memory once, written as:
output[i] = bar(foo(input[x]));
This code can be written in several different ways, which can lead it to being misinterpreted.