As some hopefully useful intuition of why we're applying the chain rule, consider that when we feed an output of one layer (Layer 1) to the input of another layer (Layer 2), the "rate of change" of the input in Layer 2 is the "rate of the change" of the output of Layer 1. Thus, as the rate of change of the output of Layer 1 changes, the rate of change of the input of Layer 2 will change, which will in turn affect the rate of change of its output (hence, the chain rule).

In general, the chain rule says that for $F(x) = f(g(x))$, $F'(x) = f'(g(x))\cdot g'(x)$. This is essentially saying "we have to multiply the rate of change of $f$ by the rate of change of $g$ on its input because as $g$'s rate changes, $f$'s rate will change proportionally". The same exact logic applies to layering inputs and outputs of functions in the neural network.

As some hopefully useful intuition of why we're applying the chain rule, consider that when we feed an output of one layer (Layer 1) to the input of another layer (Layer 2), the "rate of change" of the input in Layer 2 is the "rate of the change" of the output of Layer 1. Thus, as the rate of change of the output of Layer 1 changes, the rate of change of the input of Layer 2 will change, which will in turn affect the rate of change of its output (hence, the chain rule).

In general, the chain rule says that for $F(x) = f(g(x))$, $F'(x) = f'(g(x))\cdot g'(x)$. This is essentially saying "we have to multiply the rate of change of $f$ by the rate of change of $g$ on its input because as $g$'s rate changes, $f$'s rate will change proportionally". The same exact logic applies to layering inputs and outputs of functions in the neural network.

Hopefully this didn't confuse people

This was really helpful!