Back Propagation

An efficient way to apply the chain rule for partial differentiation, in order to calculate the gradient of the loss function with respect to the parameters of the network.

The error of neuron j in layer l is:

\delta_j^l = \frac{\partial J}{\partial z_j^l} = \frac{\partial J}{\partial a_j^l} \cdot \phi'(z_j^l)

In MLP, the error depends on neurons in layer l only through the activation of neurons in the subsequent layer l+1, so we can use the chain rule to write,

$$ \Delta^{(l)}_j = \frac{\partial J}{\partial Z^{(l)}j} = \sum{k} \frac{\partial J}{\partial Z^{(l+1)}_k} \frac{\partial Z^{(l+1)}_k}{\partial Z^{(l)}j} \\ = \sum{k} \Delta^{(l+1)}_k \frac{\partial Z^{(l+1)}k}{\partial Z^{(l)}j} \\ = \left( \sum{k} W^{(l+1)}{jk} \Delta^{(l+1)}_k \right) \odot \Phi'^{(l)}(Z^{(l)}_j) $$

In vector form,

$$ \Delta^{(l)} = \frac{\partial J}{\partial a^{(l)}} \odot \Phi'^{(l)}(Z^{(l)}), \\

\Delta^{(l)} = \frac{\partial J}{\partial b^{(l)}}', \\

\Delta^{(l)} = \left( W^{(l+1)} \Delta^{(l+1)} \right) \odot \Phi'^{(l)}(Z^{(l)}),

The final equation is the derivative of the loss w.r.t to the weights.

$$ \frac{\partial J}{\partial W^{(l)}_{kj}} = \frac{\partial J}{\partial Z^{(l)}_j} \frac{\partial Z^{(l)}j}{\partial W^{(l)}{kj}} = \Delta^{(l)}_j a^{(l-1)}_k $$

Forward pass: Compute $z^{(1)}$ and $a^{(l)}$ for each layer
Error at output: Compute the error of the neurons $\Delta^{L}$ at the output layer
Backward pass: propagate error backwards to obtain $\Delta^{l}$ for each subsequent layer
Calculate gradients of weights and biases

In practical examples gradients are accumulated and parameters are updates for mini-batches of data. However, can the mammalian cortex do back-propagation? Refer Stanford Seminar - Can the brain do back-propagation? Geoffrey Hinton - YouTube for more insights on this.