Your History

Menu

Loss Minimization with Regularization

Description

It is often desirable to find models with simpler forms, such as weights close to zero. To do this, a regularization function over the model parameters is added to the usual loss minimization problem (\( \alpha\) is a constant hyperparameter):

\[\htmlClass{sdt-0000000002}{\hat{f}} = \argmin_{\htmlClass{sdt-0000000084}{h} \in \htmlClass{sdt-0000000039}{\mathcal{H}}} \left[ \frac{1}{N} \sum_{i=1}^{N} \htmlClass{sdt-0000000072}{L}\left( \htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i \right) + \alpha^2 \htmlClass{sdt-0000000076}{\textup{reg}}(\htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000084}{h}}) \right]\]

Symbols Used:

This symbol denotes the optimal model for a problem.

\( y \)

This symbol stands for the ground truth of a sample. In supervised learning this is often paired with the corresponding input.

\( \mathcal{H} \)

This is the symbol representing the set of possible models.

\( \theta \)

This is the symbol we use for model weights/parameters.

\( L \)

This is the symbol for a loss function. It is a function that calculates how wrong a model's inference is compared to where it should be.

\( \textup{reg} \)

This is the symbol used for representing a regularization function.

\( h \)

This symbol denotes a model in machine learning.

\( u \)

This symbol denotes the input of a model.

Example

The following example shows how this regularized formulation of the optimization target can result in models with simpler parameters (here, closer to zero):

  1. Consider a model \( \htmlClass{sdt-0000000084}{h}_1 \) with loss \( \htmlClass{sdt-0000000072}{L}_1 = 10 \) for the weights \( \htmlClass{sdt-0000000066}{\theta}_1 = (2, 4, 6) \).
  2. Consider another model \( \htmlClass{sdt-0000000084}{h}_2 \) with higher loss \( \htmlClass{sdt-0000000072}{L}_2 = 18 \) for the weights \( \htmlClass{sdt-0000000066}{\theta}_2 = (1, 2, 3) \).
  3. Consider the L2 regularizer, giving:
    \[ \htmlClass{sdt-0000000076}{\textup{reg}}(\htmlClass{sdt-0000000066}{\theta}_1) = 2^2 + 4^2 + 6^2 = 56 \\ \htmlClass{sdt-0000000076}{\textup{reg}}(\htmlClass{sdt-0000000066}{\theta}_2) = 1^2 + 2^2 + 3^2 = 14 \]
  4. Consider \( \alpha^2 = 0.25 \) as the hyperparameter controlling the effect of the regularization term. Then:
    \[ L_1 + \alpha^2 \htmlClass{sdt-0000000076}{\textup{reg}}(\htmlClass{sdt-0000000066}{\theta}_1) = 10 + 0.25 \cdot 56 = 10 + 14 = 24 \\ L_2 + \alpha^2 \htmlClass{sdt-0000000076}{\textup{reg}}(\htmlClass{sdt-0000000066}{\theta}_2) = 18 + 0.25 \cdot 14 = 18 + 3.5 = 21.5 \]
  5. The optimization process will choose \( \htmlClass{sdt-0000000084}{h}_2 \) over \( \htmlClass{sdt-0000000084}{h}_1 \) despite the loss of \( \htmlClass{sdt-0000000084}{h}_1 \) being lower.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 26, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf