Bartosz Witkowski - Blog.

# Preliminaries

The backpropagation is one of the most popular method of training multilayered artificial neural networks - ANN. ANN classify data, and can be thought of as function approximators. A multilayer ANN can approximate any continuous function.

A two-layered fully-connected ANN can look like this: Where $f_{A,B}$ are individual neurons. A neuron can have multiple weighted inputs, and connects to other neurons in the next layer. Additionally we add some bias – a value that doesn’t depend on the neurons injut. Here you can see a schematic representation of a single neuron. The $b^{(n)}$ here is our bias value, and $g(v)$ is a function called the activation function. Thus, the output of a single neuron $f_{A,B}(v) = g \left ( \left ( \displaystyle \sum \limits_{i = 0}^N x_i \cdot w^{(A,B)}_i \right ) + b^{(A,B)} \right)$.

The simplest activation function is the identity function. When using a threshold activation function a single neuron divides the space into two parts – a single neuron with a threshold activation is a primitive classifier.

A single neuron can only separate the space using a single plane/line – the two data classes must be linearly separable.

Using two layers, and a nonlinear activation function is what makes a multilayered neural network work – it let’s us map an vector from $\mathbb{R}^N$ and contract/expand it into $\mathbb{R}^{N'}$ where it now is (hopefully) linearly separable.

The output of our hypothetical network is calculated as:

Where $f_{1, i}(v)$ is the output of the $i$-th first layer neuron:

$w_{A - B}$ is the weight between nodes $A$ and $B$, and $b_{i,j}$ is the bias of the $j$-th neuron in the $i$-th layer.

## Training

Lets suppose, you want to teach this network - it has $N_0$ inputs, and $N_3$ outputs. To teach the network you must provide a teacher set $t$ with $M$ samples. The teacher set consists of known correct input/output values.

Training the network should minimize the difference between the correct outputs in the teacher set, and the output of the network. We will try to minimize the root mean square (RMS) error.

The only thing that determines the output is the weights between the nodes of the network so the error will be a function of the weights:

Where $y_n^{(i)}$ is the $i$-th output of the $n$-th output neuron. Using this definition of error we will derive the backpropagation formula for batch learning (on-line learning would be without the sum over every sample). The $\frac{1}{2}$ term will come in handy during derivation.

To simplify bookkeeping a little we will switch to a simplified network with only one output. Because the weights between the hidden layer and output layer don’t depend on each other we can do that without any loss of generality. The weights between the input layer and hidden layer will be labeled as follows: And the weights between the hidden and output layer are simply: Because we have one output the error can now be simplified as:

To minimize the error we will update the weights of the network so that the expected error in the next iteration will be lower (gradient descent).

Where $l$ is the iteration “counter”, and $h$ is a learning rate $% $.

# Deriving the formula

## Hidden layer – output layer

The weight update formula for $\hat{w_1}$ is: $\hat{w}_1^{(l + 1)} = \hat{w}_1^{(l)} - h \cdot \nabla E(\hat{w}_1^{(l)})$

The error:

So lets proceed to derivate the error function.

Lets skip writing out the iteration index $l$ to clear it a bit.

Using the chain rule:

The $\frac{1}{2}$ and $2$ terms cancel each other out leaving:

The known outputs $y^{(i)}$ don’t depend on the input weights, and are essentially constants. W.r.t $w_1$ their derivative is zero, leaving:

Where $\hat{r}$ is the input to the output layer neuron, $\hat{r}^{(i)}$ is the neuron input (what goes to the activation function) with respect to the $i$-th training input. $r_j$ will be the input of the $j$-th neuron in the hidden layer:

Where $\hat{b}$ is the bias of the output neuron. If we use the chain rule once again we will have:

Clearly, the only term in the expression $\displaystyle \left( \sum \limits_{j = 1}^{N_1} g_{i}(v) \cdot \hat{w_i} \right) + \hat{b}$ that depends on $\hat{w}_1$ is $f_1(r^{(i)}_1) \cdot w_1$. The rest of the sum will zero out after derivation. Therefore, we arrive at:

And the full formula for the weight update:

Derivations of $\displaystyle \hat{w}_2 \ldots \hat{w}_{N_1}$ are similar and I’ll leave them out.

## Input layer – hidden layer

Most of this derivation will be similar to output layer case written before, I’ll start commenting when we’ll get to the differences. I’ll only do the deriviation for $w_{1,1}$, others can be done in a similar fashion.

Lets drop the unwieldy indices as before and continue:

Here, it gets interesting, we must find the derivate of $\displaystyle g(\hat{r}^{(i)})$ w.r.t $w_{1, 1}$.

Using the chain rule:

The only part of the sum $\displaystyle \sum \limits_{j = 1}^{N_1} g(r_j^{(i)}) \cdot \hat{w_j}$ that depends on $w_{1, 1}$ is $\displaystyle g(r_1^{(i)}) \cdot \hat{w_1}$, continuing:

Using the chain rule:

In the sum: $\displaystyle \sum \limits_{j = 1}^{N_0} x_{j}^{(i)} \cdot w_{1, j}$ the only term with a non-zero derivative w.r.t $w_{1, 1}$ is $x_1^{(i)} \cdot w_{1, 1}$

Finally, the weight update rule for the hidden layer:

## Biases

Derving biases of neurons is mostly the same as regular weights, the only difference is in a couple of last steps:

Similarly, for the hidden layer:

# Putting it all together

To sum up all we’ve done so far, the weight update rules for the output layer:

And for the hidden layer:

And finally, biases:

# Why backpropagation?

Why the weight update algorithm is called the backpropagation algorithm may not be so apparent when considering the usual case of a network with two layers. For it to become more obvious, we will consider a simple network with three layers:   The weight update for the output layer is similar as before:

For second hidden layer:

## Weight update for the first hidden layer

Things start to be interesting in the first hidden layer: let’s analyze the weight update for $w_{x_1,1}$:

I’ll do the relevant partials separately:

The derivative of the bias is, of course, zero:

The first term:

The deriative of $b_3$ and output of the $2$-nd neuron are 0 w.r.t $w_{x_1,1}$, so:

The second part is:

Putting it toghether we get:

## Error signal

The $(y^{(i)} - t^{(i)})$ term is sometimes written as $\delta^{(i)}$ and is called the error signal. If we rewrite the weight update rules for the output layer to use it we will get:

For the second hidden layer:

Writing in the weight update rule for $w_{x_1,1}$:

### Propagating the error signal

We can notice that some elements from the previous layers showed up:

To unify this we have to think in terms of the error signal. What can be though is that the error from the previous layer ripples down - but it has to be scaled down proportional to the amount of how the previous layer influenced it.

The weight gradient is then simply a multiplication of the error introduced to the output times the gradient of the activation function of the current neurons input times this neurons input. Recolored:

• error introduced.
• input.

So for the second hidden layer:

Does it hold for the first hidden layer?:

I’ll try to tidy up the expression in parenthesis:

Substituting it back:

Which is the same term we came up with previously, when we derived it explicitly.

# Summary

To sum up:

• The gist of the weight update algorithm is error backpropagation.
• The error signal “travels” from the output layer to the input layerj
• The weights influence the error by some degree, and that weight must be taken into account when propagating the error.

And finally, some pictures. I’ve highlighted how the error signal propagates through the network, and how the previous errors contributed to the current error:     