Your History

Menu

Gradient Empirical Risk (sum of gradients)

Prerequisites

Empirical Risk of a Model | \(R^{emp}(h) = \frac{1}{N} \sum^{N}_{i=1} L (h(u_i), y_i)\)
Gradient Empirical Risk | \(\nabla R^\text{emp}(\mathcal{N}_{\theta^{(n)}})=\left(\frac{\partial R^\text{emp}}{\partial \theta_1}(\theta^{(n)}),\dots,\frac{\partial R^\text{emp}}{\partial \theta_{L}}(\theta^{(n)})\right)\)

Description

We can write the gradient of the empirical risk as a sum of gradients. This is also exactly how the gradient is computed in practice. By doing a sweep through the training set (an 'epoch'), we can compute the gradient by aggregating the gradient of the loss function with respect to each training sample.

\[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N}\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\]

Symbols Used:

This is the symbol used for a function approximator, typically a neural network.

\( i \)

This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.

\( R \)

This symbol denotes the risk of a model.

\( \theta \)

This is the symbol we use for model weights/parameters.

\( \mathbf{y} \)

This symbol represents the output activation vector of a neural network.

\( L \)

This is the symbol for a loss function. It is a function that calculates how wrong a model's inference is compared to where it should be.

\( \nabla \)

This symbol represents the gradient of a function.

\( u \)

This symbol denotes the input of a model.

Derivation

  1. Recall the definition of the empirical risk \(\htmlClass{sdt-0000000062}{R}^\text{emp}\) of a model \(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}\). \[\htmlClass{sdt-0000000062}{R}^{emp}(\htmlClass{sdt-0000000084}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
  2. Recall the definition of the gradient of the empirical risk \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})})\right)\]
  3. We can plug in the definition of the empirical risk to obtain \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\htmlClass{sdt-0000000093}{\nabla}\left(\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\right).\]
  4. Using the linearity of differentiation, we know that the gradient of a sum is the sum of the gradients. This is similar to the sum rule, as you know from single-variable calculus. Also, note that we can take out the constant \(\frac{1}{N}\). We obtain: \[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}})=\frac{1}{N}\sum_{{\htmlClass{sdt-0000000018}{i}}=1,\dots,N}\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000072}{L}\left(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_{\htmlClass{sdt-0000000018}{i}}), \htmlClass{sdt-0000000068}{\mathbf{y}}_{\htmlClass{sdt-0000000018}{i}}\right)\] as required.

References

  1. Jaeger, H. (2024, May 4). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf