The backpropagation is one of the most popular method of training multilayered
artificial neural networks - **ANN**. ANN classify data, and can be thought of
as function approximators. A multilayer ANN can approximate any continuous
function.

A two-layered fully-connected ANN can look like this:

Where are individual neurons. A neuron can have multiple weighted inputs, and connects to other neurons in the next layer. Additionally we add some bias – a value that doesn’t depend on the neurons injut. Here you can see a schematic representation of a single neuron.

The here is our bias value, and is a function called the
*activation function*. Thus, the output of a single neuron .

The simplest activation function is the identity function. When using a threshold activation function a single neuron divides the space into two parts – a single neuron with a threshold activation is a primitive classifier.

A single neuron can only separate the space using a single plane/line – the two data classes must be linearly separable.

Using two layers, and a nonlinear activation function is what makes a multilayered neural network work – it let’s us map an vector from and contract/expand it into where it now is (hopefully) linearly separable.

The output of our hypothetical network is calculated as:

Where is the output of the -th first layer neuron:

is the weight between nodes and , and is the bias of the -th neuron in the -th layer.

Lets suppose, you want to teach this network - it has inputs, and
outputs. To teach the network you must provide a *teacher set* with
samples. The teacher set consists of known **correct** input/output values.

Training the network should minimize the difference between the correct outputs in the teacher set, and the output of the network. We will try to minimize the root mean square (RMS) error.

The only thing that determines the output is the weights between the nodes of the network so the error will be a function of the weights:

Where is the -th output of the -th output neuron. Using this definition of error we will derive the backpropagation formula for batch learning (on-line learning would be without the sum over every sample). The term will come in handy during derivation.

To simplify bookkeeping a little we will switch to a simplified network with only one output. Because the weights between the hidden layer and output layer don’t depend on each other we can do that without any loss of generality.

The weights between the input layer and hidden layer will be labeled as follows:

And the weights between the hidden and output layer are simply:

Because we have one output the error can now be simplified as:

To minimize the error we will update the weights of the network so that the expected error in the next iteration will be lower (gradient descent).

Where is the iteration “counter”, and is a learning rate .

The weight update formula for is:

The error:

So lets proceed to derivate the error function.

Lets skip writing out the iteration index to clear it a bit.

Using the chain rule:

The and terms cancel each other out leaving:

The known outputs don’t depend on the input weights, and are essentially constants. W.r.t their derivative is zero, leaving:

Where is the input to the output layer neuron, is the neuron input (what goes to the activation function) with respect to the -th training input. will be the input of the -th neuron in the hidden layer:

Where is the bias of the output neuron. If we use the chain rule once again we will have:

Clearly, the only term in the expression that depends on is . The rest of the sum will zero out after derivation. Therefore, we arrive at:

And the full formula for the weight update:

Derivations of are similar and I’ll leave them out.

Most of this derivation will be similar to output layer case written before, I’ll start commenting when we’ll get to the differences. I’ll only do the deriviation for , others can be done in a similar fashion.

Lets drop the unwieldy indices as before and continue:

Here, it gets interesting, we must find the derivate of w.r.t .

Using the chain rule:

The only part of the sum that depends on is , continuing:

Using the chain rule:

In the sum: the only term with a non-zero derivative w.r.t is

Finally, the weight update rule for the hidden layer:

Derving biases of neurons is mostly the same as regular weights, the only difference is in a couple of last steps:

Similarly, for the hidden layer:

To sum up all we’ve done so far, the weight update rules for the output layer:

And for the hidden layer:

And finally, biases:

Why the weight update algorithm is called the **backpropagation** algorithm may
not be so apparent when considering the usual case of a network with two layers.
For it to become more obvious, we will consider a simple network with three
layers:

The weight update for the output layer is similar as before:

For second hidden layer:

Things start to be interesting in the first hidden layer: let’s analyze the weight update for :

I’ll do the relevant partials separately:

The derivative of the bias is, of course, zero:

The first term:

The deriative of and output of the -nd neuron are 0 w.r.t , so:

The second part is:

Putting it toghether we get:

The term is sometimes written as and is called the error signal. If we rewrite the weight update rules for the output layer to use it we will get:

For the second hidden layer:

Writing in the weight update rule for :

We can notice that some elements from the previous layers showed up:

To unify this we have to think in terms of the error signal. What can be though is that the error from the previous layer ripples down - but it has to be scaled down proportional to the amount of how the previous layer influenced it.

The weight gradient is then simply a multiplication of the error introduced to
the output **times** the gradient of the activation function of the current neurons
input **times** this neurons input. Recolored:

- error introduced.
- gradient.
- input.

So for the second hidden layer:

Does it hold for the first hidden layer?:

I’ll try to tidy up the expression in parenthesis:

Substituting it back:

Which is the same term we came up with previously, when we derived it explicitly.

To sum up:

- The gist of the weight update algorithm is error backpropagation.
- The error signal “travels” from the output layer to the input layerj
- The weights influence the error by some degree, and that weight must be taken into account when propagating the error.

And finally, some pictures. I’ve highlighted how the error signal propagates through the network, and how the previous errors contributed to the current error: