Your History

Menu

Gradient Empirical Risk

Prerequisites

Empirical Risk of a Model | \(R^{emp}(h) = \frac{1}{N} \sum^{N}_{i=1} L (h(u_i), y_i)\)

Description

This is the gradient of the empirical risk of a model with respect to the models parameters \( \htmlClass{sdt-0000000066}{\theta} \) at some timestep \( \htmlClass{sdt-0000000117}{n} \). When training a machine learning model (to minimize the empirical risk), computing the gradient is a crucial step. By essentially determining in what direction the weights of the model need to move, the gradient can be used to update the weights of the model for each training step.

\[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})})\right)\]

Symbols Used:

This is the symbol used for a function approximator, typically a neural network.

\( R \)

This symbol denotes the risk of a model.

\( \theta \)

This is the symbol we use for model weights/parameters.

\( \nabla \)

This symbol represents the gradient of a function.

\( n \)

This symbol represents any given whole number, \( n \in \htmlClass{sdt-0000000014}{\mathbb{W}}\).

\( L \)

This symbol refers to the number of neurons in a layer.

Derivation

  1. Consider an arbitrary function \(f: \htmlClass{sdt-0000000045}{\mathbb{R}}^{\htmlClass{sdt-0000000117}{n}}\mapsto \htmlClass{sdt-0000000045}{\mathbb{R}}\). Recall that its gradient at the point \(p=(x_{1,\dots,\htmlClass{sdt-0000000117}{n}})\) is given by \[\htmlClass{sdt-0000000093}{\nabla} f(p)=\left(\frac{\partial f}{\partial x_1}(p),\dots,\frac{\partial f}{\partial x_{\htmlClass{sdt-0000000117}{n}}}(p)\right).\]
  2. Consider the definition of the empirical risk of a model:
    \[\htmlClass{sdt-0000000062}{R}^{emp}(\htmlClass{sdt-0000000084}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
  3. Now, \(\htmlClass{sdt-0000000062}{R}^\text{emp}\) is our arbitrary function, which maps the model at time \(\htmlClass{sdt-0000000117}{n}\) to a (scalar) empirical risk. Recall that a model \(\htmlClass{sdt-0000000001}{\mathcal{N}}\) is characterized by its parameter vector \(\htmlClass{sdt-0000000066}{\theta}\). In other words, we compute the empirical risk of \(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}}\).
  4. We want to compute the gradient of the empirical risk at the point \(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}\). Plugging in the variables into the definition of the gradient gives: \[\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\right)\]

as required.

References

  1. Jaeger, H. (2024, May 4). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf