Your History

Menu

Update rule of the Gradient Descent

Prerequisites

Gradient Empirical Risk | \(\nabla R^\text{emp}(\mathcal{N}_{\theta^{(n)}})=\left(\frac{\partial R^\text{emp}}{\partial \theta_1}(\theta^{(n)}),\dots,\frac{\partial R^\text{emp}}{\partial \theta_{L}}(\theta^{(n)})\right)\)

Description

This equation described a simple gradient descent update. It considers the gradient of the risk, \( \htmlClass{sdt-0000000062}{R} \)(\( \htmlClass{sdt-0000000083}{\theta} \)) to adjust the parameters of the model \( \htmlClass{sdt-0000000084}{h} \) parametrized by parameters \(\htmlClass{sdt-0000000083}{\theta}\).

\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]

Symbols Used:

This symbol denotes the risk of a model.

\( \theta \)

This symbol represents the parameters of the model

\( \nabla \)

This symbol represents the gradient of a function.

\( \mu \)

This is the symbol representing the learning rate.

Derivation

Given a model parametrization \(\htmlClass{sdt-0000000083}{\theta} \in \htmlClass{sdt-0000000045}{\mathbb{R}}^D\), we can calculate the gradient of the risk, similarly as in Gradient Empirical Risk:

\[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})})\right)\]

to obtain \(\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}) \in \htmlClass{sdt-0000000045}{\mathbb{R}}^D\). Intuitively, this vector indicates a direction in which the risk grows the fastest.

Naturally, we want to minimize the risk, hence we want to shift the parameters in the direction that minimizes the risk. This forces us to multiply it by -1, to obtain \[-\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}).\]

Further, we want to shift these parameters by only a small amount. Otherwise, the training may be unstable and the training will never converge. This step size is controlled by the learning rate \( \htmlClass{sdt-0000000106}{\mu} \): \[-\htmlClass{sdt-0000000106}{\mu}\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}).\]

Since we have defined the step that we want to move our parameters by, we can finally update them:

\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]

References

  1. Jaeger, H. (2024). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 14, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf