It would be interesting to compare the performance of Halide blur program against optimized blur program in the previous slide.
One of the other things I liked about Halide(saw it in spring 2016 lecture could not find it here) was the scheduling primitives using which we can change our code to try out solutions with minimal(compute_at(ROOT)) and maximal(inline()) locality by changing just one line of code.
This is similar to how atomic code is declarative(tells what to do) and locks are imperative(tells how to do it), Halide is declarative, while all of the C++ code on the previous page would be imperative. Because the programmer only has to specify what operations need to be done, they save time by not having to worry about how exactly to parallelize it.
I guess Halide will a least perform as well as the optimized blur program because it is optimized for doing this specific task and may combine more optimizations aggressively.