Description
Like any Machine Learning model, a Multi Layer Perceptron has an associated risk that is sought to be minimized. Because once again the distribution of the input and output are generally unknown, the empirical risk over some data is used instead.
\[\htmlClass{sdt-0000000066}{\theta}_\text{opt} = \argmin_{\htmlClass{sdt-0000000066}{\theta} \in \htmlClass{sdt-0000000052}{\Theta}} \frac{1}{N} \sum_{i=1}^{N} \htmlClass{sdt-0000000072}{L}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\]
Derivation
- Consider the empirical risk of some model \(\htmlClass{sdt-0000000084}{h}\):
\[\htmlClass{sdt-0000000062}{R}^{emp}(\htmlClass{sdt-0000000084}{h}) = \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\] - The minimization of empirical risk for \( \htmlClass{sdt-0000000084}{h} \) finds the optimal model \( \htmlClass{sdt-0000000002}{\hat{f}} \):
\[\htmlClass{sdt-0000000002}{\hat{f}} = h_{opt} = \underset{h \in \htmlClass{sdt-0000000039}{\mathcal{H}}}{argmin} \hspace{0.2cm} \frac{1}{N} \sum^{N}_{i=1} L (\htmlClass{sdt-0000000084}{h}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i)\] - Consider a model in the form of a neural network \( \htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}} \) parametrized by the weights/parameters \( \htmlClass{sdt-0000000066}{\theta} \).
- Finding the optimal model corresponds to finding the optimal weights \( \htmlClass{sdt-0000000066}{\theta}_\text{opt} \).
- Replacing \( \htmlClass{sdt-0000000084}{h} \) with \( \htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}} \), and the optimization target \( \htmlClass{sdt-0000000002}{\hat{f}} \) with the optimal weights \( \htmlClass{sdt-0000000066}{\theta}_\text{opt} \), we get:
\[ \htmlClass{sdt-0000000066}{\theta}_\text{opt} = \argmin_{\htmlClass{sdt-0000000066}{\theta} \in \htmlClass{sdt-0000000052}{\Theta}} \frac{1}{N} \sum_{i=1}^{N} \htmlClass{sdt-0000000072}{L}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}}(\htmlClass{sdt-0000000103}{u}_i), \htmlClass{sdt-0000000037}{y}_i) \]
as required.
Note: Other terms such as can be added to the formulation in similar ways: see Loss Minimization with Regularization.