Description

We can write the gradient of the empirical risk as a sum of gradients. This is also exactly how the gradient is computed in practice. By doing a sweep through the training set (an 'epoch'), we can compute the gradient by aggregating the gradient of the loss function with respect to each training sample.

\[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N}\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\]

Symbols Used:

\( \mathcal{N} \)	This is the symbol used for a function approximator, typically a neural network.
\( i \)	This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.
\( R \)	This symbol denotes the risk of a model.
\( \theta \)	This is the symbol we use for model weights/parameters.
\( \mathbf{y} \)	This symbol represents the output activation vector of a neural network.
\( L \)	This is the symbol for a loss function. It is a function that calculates how wrong a model's inference is compared to where it should be.
\( \nabla \)	This symbol represents the gradient of a function.
\( u \)	This symbol denotes the input of a model.

Derivation

Recall the definition of the empirical risk \(\htmlClass{sdt-0000000062}{R}^\text{emp}\) of a model \(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}\). \[\htmlClass{sdt-0000000062}{R}^{emp}(\htmlClass{sdt-0000000084}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
Recall the definition of the gradient of the empirical risk \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})})\right)\]
We can plug in the definition of the empirical risk to obtain \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\htmlClass{sdt-0000000093}{\nabla}\left(\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\right).\]
Using the linearity of differentiation, we know that the gradient of a sum is the sum of the gradients. This is similar to the sum rule, as you know from single-variable calculus. Also, note that we can take out the constant \(\frac{1}{N}\). We obtain: \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N}\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\] as required.

References

Jaeger, H. (2024, May 4). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf

Your History

Gradient Empirical Risk (sum of gradients)

Prerequisites

Description

Symbols Used:

Derivation

References