Your History

Menu

Operationalization of Supervised Learning

Prerequisites

Empirical Risk of a Model | \(R^{emp}(h) = \frac{1}{N} \sum^{N}_{i=1} L (h(u_i), y_i)\)
Risk of Optimal Model | \(\hat{f} = h_{opt} = \underset{h \in \mathcal{H}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (h(u_i), y_i)\)
Hypothesis Space | \( \mathcal{H} \)
Sample | \( S \)
Model | \( h \)

Description

The fundamental goal of supervised learning is to discover the optimal model \( \htmlClass{sdt-0000000002}{\hat{f}} \) that minimizes risk \( \htmlClass{sdt-0000000062}{R} \) when applied to unseen testing data drawn from the distributions of random variables \( \htmlClass{sdt-0000000013}{U} \) and \( \htmlClass{sdt-0000000021}{Y} \). However, the model's only source of knowledge is the training data, comprising \(N\) samples: \(\htmlClass{sdt-0000000057}{S} = (\htmlClass{sdt-0000000103}{u}_i, \htmlClass{sdt-0000000037}{y}_i)_{i=1,...,N} \).

Because we lack access to the testing data, we optimize the model using the training data, aiming to minimize its empirical risk ("training error"). The underlying hope is that by minimizing empirical risk, the model will generalize well to the unseen testing data.

\[\mathcal{A}(\htmlClass{sdt-0000000057}{S}) = \underset{h \in \htmlClass{sdt-0000000039}{\mathcal{H}}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]

Symbols Used:

This symbol stands for the ground truth of a sample. In supervised learning this is often paired with the corresponding input.

\( \mathcal{H} \)

This is the symbol representing the set of possible models.

\( S \)

This symbol describes the pair of inputs and ground truths \((\htmlClass{sdt-0000000103}{u}_i, \htmlClass{sdt-0000000037}{y}_i)\) used to train a model.

\( h \)

This symbol denotes a model in machine learning.

\( u \)

This symbol denotes the input of a model.

Derivation

  1. Consider the definition of the training data \( \htmlClass{sdt-0000000057}{S} \):

    This symbol \(S\) describes the pair of inputs and ground truths \((\htmlClass{sdt-0000000103}{u}_i, \htmlClass{sdt-0000000037}{y}_i)_{i=1,...,N}\) used to train a model where \(N\) represents the total number of data points. This symbol is also known as the training data. The risk calculated using these samples is known as the empirical risk.

  2. Now suppose we have an arbitrary algorithm \(\mathcal{A}\) which operates on the training data and obtains the optimal model \( \htmlClass{sdt-0000000002}{\hat{f}} \) by optimising a model \( \htmlClass{sdt-0000000084}{h} \) through searching the hypothesis space \( \htmlClass{sdt-0000000039}{\mathcal{H}} \). It might be wise to recall the definition of \( \htmlClass{sdt-0000000084}{h} \):

    The symbol for a model is \(h\). It represents a machine learning model that takes an input and gives an output.


    and \( \htmlClass{sdt-0000000039}{\mathcal{H}} \):

    The symbol \( \mathcal{H} \) denotes the set of possible models, often from a particular class like "polynomials of any degree" or "multi-layer perceptron networks". For any learning algorithm, \( \mathcal{H} \) indicates the space where an optimal model may be found.

  3. We can then calculate the risk of pairs of inputs and outputs using the below formula:
    \[\htmlClass{sdt-0000000062}{R}^{emp}(\htmlClass{sdt-0000000084}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
  4. Since we are interested in the best model \( \htmlClass{sdt-0000000002}{\hat{f}} \), we can search models \( \htmlClass{sdt-0000000084}{h} \) in the hypothesis space \( \htmlClass{sdt-0000000039}{\mathcal{H}} \) and select the one with the lowest risk. This can be done using the below equation:
    \[\htmlClass{sdt-0000000002}{\hat{f}} = h_{opt} = \underset{h \in \htmlClass{sdt-0000000039}{\mathcal{H}}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
  5. We observe that the above equation fulfils the definition of our algorithm \(\mathcal{A}\) described previously. Thus, we arrive at the equation:
    \[\mathcal{A}(\htmlClass{sdt-0000000057}{S}) = \underset{h \in \htmlClass{sdt-0000000039}{\mathcal{H}}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
    as required.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 14, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf