Delta Equation

Description

This definition is used to compute how the loss is affected with a change in the potential of a neuron in a given layer. It is used in the backward pass in backpropagation, where the gradient is computed for updating the model parameters. The error term is computed differently, depending on whether the neuron is a hidden neuron or an output neuron.

Symbols Used:

\( \mathcal{N} \)	This is the symbol used for a function approximator, typically a neural network.
\( i \)	This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.
\( y \)	This symbol stands for the ground truth of a sample. In supervised learning this is often paired with the corresponding input.
\( L \)	This is the symbol for a loss function. It is a function that calculates how wrong a model's inference is compared to where it should be.
\( \delta \)	This is the error of a neuron in a feedforward neural network.
\( \theta \)	This symbol represents the parameters of the model
\( a \)	This is the potential of a neuron in a layer of a feedforward neural network.
\( u \)	This symbol denotes the input of a model.

Example

Let \(\hat{y}\) denote the model prediction \(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000083}{\theta}}(\htmlClass{sdt-0000000103}{u})\). Suppose that we have a binary classification task with a single output neuron. We use binary crossentropy loss, which is given by \[\htmlClass{sdt-0000000072}{L}\left(\hat{y},\htmlClass{sdt-0000000037}{y}\right)=−(\htmlClass{sdt-0000000037}{y}\log(\hat{y})+(1−\htmlClass{sdt-0000000037}{y})\log(1−\hat{y}))\]
Suppose that we compute the error term for the output neuron, which is the first neuron in the \(k\)-th layer, assuming that there are \(k\) layers. Then, the potential of the output neuron is equal to the model prediction, so \(\htmlClass{sdt-0000000099}{a}_1^k=\hat{y}\). That is, we are interested in the quantity \[\htmlClass{sdt-0000000075}{\delta}_1^k=\frac{\partial \htmlClass{sdt-0000000072}{L}\left(\hat{y},\htmlClass{sdt-0000000037}{y}\right)}{\partial \htmlClass{sdt-0000000099}{a}_1^k}=-\frac{\partial}{\partial \hat{y}} \left(\htmlClass{sdt-0000000037}{y}\log(\hat{y})+(1−\htmlClass{sdt-0000000037}{y})\log(1−\hat{y})\right).\]
Applying the sum rule gives \[\htmlClass{sdt-0000000075}{\delta}_1^k=-\left(\frac{\partial}{\partial \hat{y}}\htmlClass{sdt-0000000037}{y}\log(\hat{y})+\frac{\partial}{\partial \hat{y}}(1−\htmlClass{sdt-0000000037}{y})\log(1−\hat{y})\right).\]
Using the derivative of the logarithm and the chain rule gives \[\htmlClass{sdt-0000000075}{\delta}_1^k=-\left(\frac{\htmlClass{sdt-0000000037}{y}}{\hat{y}}-\frac{1-\htmlClass{sdt-0000000037}{y}}{1-\hat{y}}\right)=\frac{1-\htmlClass{sdt-0000000037}{y}}{1-\hat{y}}-\frac{\htmlClass{sdt-0000000037}{y}}{\hat{y}}.\]
Since \(\htmlClass{sdt-0000000037}{y}\in\{0,1\}\), we can also represent \(\htmlClass{sdt-0000000075}{\delta}_1^k\) as a piecewise function: \[\htmlClass{sdt-0000000075}{\delta}_1^k=\begin{cases}-\frac{1}{\hat{y}}&\text{if }\htmlClass{sdt-0000000037}{y}=1\\\frac{1}{1-\hat{y}}&\text{otherwise.}\end{cases}\]

Your History

Description

Symbols Used:

Example

References