Bartosz Witkowski - Blog.
Home About me


The backpropagation is one of the most popular method of training multilayered artificial neural networks - ANN. ANN classify data, and can be thought of as function approximators. A multilayer ANN can approximate any continuous function.

A two-layered fully-connected ANN can look like this:

A fully connected ANN.

Where are individual neurons. A neuron can have multiple weighted inputs, and connects to other neurons in the next layer. Additionally we add some bias – a value that doesn’t depend on the neurons injut. Here you can see a schematic representation of a single neuron.

A single neuron.

The here is our bias value, and is a function called the activation function. Thus, the output of a single neuron .

The simplest activation function is the identity function. When using a threshold activation function a single neuron divides the space into two parts – a single neuron with a threshold activation is a primitive classifier.

A single neuron can only separate the space using a single plane/line – the two data classes must be linearly separable.

Using two layers, and a nonlinear activation function is what makes a multilayered neural network work – it let’s us map an vector from and contract/expand it into where it now is (hopefully) linearly separable.

The output of our hypothetical network is calculated as:

Where is the output of the -th first layer neuron:

is the weight between nodes and , and is the bias of the -th neuron in the -th layer.


Lets suppose, you want to teach this network - it has inputs, and outputs. To teach the network you must provide a teacher set with samples. The teacher set consists of known correct input/output values.

Training the network should minimize the difference between the correct outputs in the teacher set, and the output of the network. We will try to minimize the root mean square (RMS) error.

The only thing that determines the output is the weights between the nodes of the network so the error will be a function of the weights:

Where is the -th output of the -th output neuron. Using this definition of error we will derive the backpropagation formula for batch learning (on-line learning would be without the sum over every sample). The term will come in handy during derivation.

To simplify bookkeeping a little we will switch to a simplified network with only one output. Because the weights between the hidden layer and output layer don’t depend on each other we can do that without any loss of generality.

ANN with one output neuron.

The weights between the input layer and hidden layer will be labeled as follows:

Labels for the weights between the first and hidden layer.

And the weights between the hidden and output layer are simply:

Lables for the weights between the hidden and output layer.

Because we have one output the error can now be simplified as:

To minimize the error we will update the weights of the network so that the expected error in the next iteration will be lower (gradient descent).

Where is the iteration “counter”, and is a learning rate .

Deriving the formula

Hidden layer – output layer

The weight update formula for is:

The error:

So lets proceed to derivate the error function.

Lets skip writing out the iteration index to clear it a bit.

Using the chain rule:

The and terms cancel each other out leaving:

The known outputs don’t depend on the input weights, and are essentially constants. W.r.t their derivative is zero, leaving:

Where is the input to the output layer neuron, is the neuron input (what goes to the activation function) with respect to the -th training input. will be the input of the -th neuron in the hidden layer:

Where is the bias of the output neuron. If we use the chain rule once again we will have:

Clearly, the only term in the expression that depends on is . The rest of the sum will zero out after derivation. Therefore, we arrive at:

And the full formula for the weight update:

Derivations of are similar and I’ll leave them out.

Input layer – hidden layer

Most of this derivation will be similar to output layer case written before, I’ll start commenting when we’ll get to the differences. I’ll only do the deriviation for , others can be done in a similar fashion.

Lets drop the unwieldy indices as before and continue:

Here, it gets interesting, we must find the derivate of w.r.t .

Using the chain rule:

The only part of the sum that depends on is , continuing:

Using the chain rule:

In the sum: the only term with a non-zero derivative w.r.t is

Finally, the weight update rule for the hidden layer:


Derving biases of neurons is mostly the same as regular weights, the only difference is in a couple of last steps:

Similarly, for the hidden layer:

Putting it all together

To sum up all we’ve done so far, the weight update rules for the output layer:

And for the hidden layer:

And finally, biases:

Why backpropagation?

Why the weight update algorithm is called the backpropagation algorithm may not be so apparent when considering the usual case of a network with two layers. For it to become more obvious, we will consider a simple network with three layers:

A neural network with three layers ANN. The upper weights labeled. The lower weights labeled.

The weight update for the output layer is similar as before:

For second hidden layer:

Weight update for the first hidden layer

Things start to be interesting in the first hidden layer: let’s analyze the weight update for :

I’ll do the relevant partials separately:

The derivative of the bias is, of course, zero:

The first term:

The deriative of and output of the -nd neuron are 0 w.r.t , so:

The second part is:

Putting it toghether we get:

Error signal

The term is sometimes written as and is called the error signal. If we rewrite the weight update rules for the output layer to use it we will get:

For the second hidden layer:

Writing in the weight update rule for :

Propagating the error signal

We can notice that some elements from the previous layers showed up:

To unify this we have to think in terms of the error signal. What can be though is that the error from the previous layer ripples down - but it has to be scaled down proportional to the amount of how the previous layer influenced it.

The weight gradient is then simply a multiplication of the error introduced to the output times the gradient of the activation function of the current neurons input times this neurons input. Recolored:

So for the second hidden layer:

Does it hold for the first hidden layer?:

I’ll try to tidy up the expression in parenthesis:

Substituting it back:

Which is the same term we came up with previously, when we derived it explicitly.


To sum up:

And finally, some pictures. I’ve highlighted how the error signal propagates through the network, and how the previous errors contributed to the current error:

Error signal propagates from y to f_5.

Error signal propagates from f_5 to f_3 and f_4.

Error signal propagates from f_3 to f_1 and f_2.

Error signal propagates from f_4 to f_1 and f_2.

Error signal propagates from f_1 to x_1 and x_2.