The Wikipedia page on automatic differentiation describes a different way of computing the derivative using dual numbers. Can this be used instead of backpropagating, and how does this compare memory-wise?
Just to confirm my understanding, we aren't interested in the partial of f against the input, right? The partials against the weights are what we are after.
@xingdaz Yes, but to get that we need partial of f against input of each layer.
@dhua: Automatic differentiation is just an algebraic trick for computing the derivative alongside the number it is the derivative of. It does not actually change anything fundamental about the computation.
Here's an article on how they relate: https://idontgetoutmuch.wordpress.com/2013/10/13/backpropogation-is-just-steepest-descent-with-automatic-differentiation-2/