User:AI456/sandbox

This is the user sandbox of AI456. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Derivation edit

Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. The squared error function is (the ${\frac {1}{2}}$ term is added to cancel the exponent when differentiating):

	$E={\frac {1}{2}}(t-y)^{2}$ ,
	$E$ = the squared error
	$t$ = target output
	$y$ = actual output of the output neuron^{[note 1]}

Therefore the error, $E$ , depends on the output $y$ . However, the output $y$ depends on the weighted sum of all its input:

	$y=\sum _{i=1}^{n}w_{i}x_{i}$
	$n$ = the number of input units to the neuron
	$w_{i}$ = the i^th weight
	$x_{i}$ = the i^th input value to the neuron

The above formula only holds true for a neuron with a linear activation function (that is the output is solely the weighted sum of the input). In general, a non-linear, differentiable activation function, $\varphi$ , is used. Thus, more correctly:

	$y=\varphi (net)$
	$net=\sum _{i=1}^{n}w_{i}x_{i}$

This lays the groundwork for calculating the partial derivative of the error with respect to a weight $w_{i}$ using the chain rule:

	${\frac {\partial E}{\partial w_{i}}}={\frac {dE}{dy}}{\frac {dy}{dnet}}{\frac {\partial net}{\partial w_{i}}}$
	${\frac {\partial E}{\partial w_{i}}}$ = How the error changes when the weights are changed
	${\frac {dE}{dy}}$ = How the error changes when the output is changed
	${\frac {dy}{dnet}}$ = How the output changes when the weighted sum changes
	${\frac {\partial net}{\partial w_{i}}}$ = How the weighted sum changes as the weights change

Since the weighted sum $net$ is just the sum over all products $w_{i}$ $x_{i}$ , therefore the partial derivative of the sum with respect to a weight $w_{i}$ is the just the corresponding input $x_{i}$ . Similarly, the partial derivative of the sum with respect to an input value $x_{i}$ is just the weight $w_{i}$ :

	${\frac {\partial net}{\partial w_{i}}}=x_{i}$
	${\frac {\partial net}{\partial x_{i}}}=w_{i}$

The derivative of the output $y$ with respect to the weighted sum $net$ is simply the derivative of the activation function $\varphi$ :

{\frac {dy}{dnet}}={\frac {d}{dnet}}\varphi

This is the reason why backpropagation requires the activation function to be differentiable. A commonly used activation function is the logistic function:

y={\frac {1}{1+e^{-z}}}

which has a nice derivative of:

{\frac {dy}{dt}}=y(1-y)

For example purposes, assume the network uses a logistic activation function, in which case the derivative of the output $y$ with respect to the weighted sum $net$ is the same as the derivative of the logistic function:

{\frac {dy}{dnet}}=y(1-y)

Finally, the derivative of the error $E$ with respect to the output $y$ is:

	${\frac {dE}{dy}}={\frac {d}{dy}}{\frac {1}{2}}(t-y)^{2}$
	${\frac {dE}{dy}}=t-y$

Putting it all together:

	${\frac {\partial E}{\partial w_{i}}}={\frac {dE}{dy}}{\frac {dy}{dnet}}{\frac {\partial net}{\partial w_{i}}}$
	${\frac {\partial E}{\partial w_{i}}}=(t-y)y(1-y)x_{i}$

If one were to use a different activation function, the only difference would be the $y(1-y)$ term will be replaced by the derivative of the newly chosen activation function.

To update the weight $w_{i}$ using gradient descent, one must chooses a learning rate, $\alpha$ . The change in weight after learning then would be the product of the learning rate and the gradient:

	$\Delta w_{i}=\alpha {\frac {\partial E}{\partial w_{i}}}$
	$\Delta w_{i}=\alpha (t-y)\varphi 'x_{i}$

For a linear neuron, the derivative of the activation function $\varphi$ is 1, which yields:

\Delta w_{i}=\alpha (t-y)x_{i}

This is exactly the delta rule for perceptron learning, which is why the backpropagation algorithm is a generalization of the delta rule. In backpropagation and perceptron learning, when the output $y$ matches the desired output $t$ , the change in weight $\Delta w_{i}$ would be zero, which is exactly what is desired.

Limitations and Improvements edit

The result may converge to a local minimum edit

The "hill climbing" strategy of gradient descent is guaranteed to work if there is only one minimum. However, often times the error surface has many local minimum and maximum. If the starting point of the gradient descent happens to be somewhere between a local maximum and local minimum, then going down the direction with the most negative gradient will lead to the local minimum.

Gradient descent can find the local minimum instead of the global minimum

Learning is slow when error surface is elongated edit

Solution: Scale the inputs to have zero mean over the training set edit

Consider the following training example: (101, 101) -> 2 (101, 99) -> 0

Notes edit

^ There can be multiple output neurons, however backpropagation treats each in isolation when calculating the gradient, therefore, in the rest of the derivation, only one output neuron is considered.

References edit

[1] There can be multiple output neurons, however backpropagation treats each in isolation when calculating the gradient, therefore, in the rest of the derivation, only one output neuron is considered.

[note 1]