Derivation edit

Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. The squared error function is (the   term is added to cancel the exponent when differentiating):

 ,
  = the squared error
  = target output
  = actual output of the output neuron[note 1]

Therefore the error,  , depends on the output  . However, the output   depends on the weighted sum of all its input:

 
  = the number of input units to the neuron
  = the ith weight
  = the ith input value to the neuron

The above formula only holds true for a neuron with a linear activation function (that is the output is solely the weighted sum of the input). In general, a non-linear, differentiable activation function,  , is used. Thus, more correctly:

 
 

This lays the groundwork for calculating the partial derivative of the error with respect to a weight   using the chain rule:

 
  = How the error changes when the weights are changed
  = How the error changes when the output is changed
  = How the output changes when the weighted sum changes
  = How the weighted sum changes as the weights change

Since the weighted sum   is just the sum over all products    , therefore the partial derivative of the sum with respect to a weight   is the just the corresponding input  . Similarly, the partial derivative of the sum with respect to an input value   is just the weight  :

 
 

The derivative of the output   with respect to the weighted sum   is simply the derivative of the activation function  :

 

This is the reason why backpropagation requires the activation function to be differentiable. A commonly used activation function is the logistic function:

 

which has a nice derivative of:

 

For example purposes, assume the network uses a logistic activation function, in which case the derivative of the output   with respect to the weighted sum   is the same as the derivative of the logistic function:

 

Finally, the derivative of the error   with respect to the output   is:

 
 

Putting it all together:

 
 

If one were to use a different activation function, the only difference would be the   term will be replaced by the derivative of the newly chosen activation function.

To update the weight   using gradient descent, one must chooses a learning rate,  . The change in weight after learning then would be the product of the learning rate and the gradient:

 
 

For a linear neuron, the derivative of the activation function   is 1, which yields:

 

This is exactly the delta rule for perceptron learning, which is why the backpropagation algorithm is a generalization of the delta rule. In backpropagation and perceptron learning, when the output   matches the desired output  , the change in weight   would be zero, which is exactly what is desired.

Limitations and Improvements edit

The result may converge to a local minimum edit

The "hill climbing" strategy of gradient descent is guaranteed to work if there is only one minimum. However, often times the error surface has many local minimum and maximum. If the starting point of the gradient descent happens to be somewhere between a local maximum and local minimum, then going down the direction with the most negative gradient will lead to the local minimum.

 
Gradient descent can find the local minimum instead of the global minimum

Learning is slow when error surface is elongated edit

Solution: Scale the inputs to have zero mean over the training set edit

Consider the following training example: (101, 101) -> 2 (101, 99) -> 0

Notes edit

  1. ^ There can be multiple output neurons, however backpropagation treats each in isolation when calculating the gradient, therefore, in the rest of the derivation, only one output neuron is considered.

References edit