Update rule of the Gradient Descent

Description

This equation described a simple gradient descent update. It considers the gradient of the risk, \( \htmlClass{sdt-0000000062}{R} \)(\( \htmlClass{sdt-0000000083}{\theta} \)) to adjust the parameters of the model \( \htmlClass{sdt-0000000084}{h} \) parametrized by parameters \(\htmlClass{sdt-0000000083}{\theta}\).

Symbols Used:

\( R \)	This symbol denotes the risk of a model.
\( \theta \)	This symbol represents the parameters of the model
\( \nabla \)	This symbol represents the gradient of a function.
\( \mu \)	This is the symbol representing the learning rate.

Derivation

Given a model parametrization \(\htmlClass{sdt-0000000083}{\theta} \in \htmlClass{sdt-0000000045}{\mathbb{R}}^D\), we can calculate the gradient of the risk, similarly as in Gradient Empirical Risk:

\[\htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}^\text{emp}(\htmlClass{sdt-0000000001}{\mathcal{N}}_{\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}})=\left(\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_1}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})}),\dots,\frac{\partial \htmlClass{sdt-0000000062}{R}^\text{emp}}{\partial \htmlClass{sdt-0000000066}{\theta}_{\htmlClass{sdt-0000000119}{L}}}(\htmlClass{sdt-0000000066}{\theta}^{(\htmlClass{sdt-0000000117}{n})})\right)\]

to obtain \(\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}) \in \htmlClass{sdt-0000000045}{\mathbb{R}}^D\). Intuitively, this vector indicates a direction in which the risk grows the fastest.

Naturally, we want to minimize the risk, hence we want to shift the parameters in the direction that minimizes the risk. This forces us to multiply it by -1, to obtain \[-\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}).\]

Further, we want to shift these parameters by only a small amount. Otherwise, the training may be unstable and the training will never converge. This step size is controlled by the learning rate \( \htmlClass{sdt-0000000106}{\mu} \): \[-\htmlClass{sdt-0000000106}{\mu}\htmlClass{sdt-0000000093}{\nabla}\htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000083}{\theta}).\]

Since we have defined the step that we want to move our parameters by, we can finally update them:

\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]

Your History

Update rule of the Gradient Descent

Prerequisites

Description

\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]

Symbols Used:

Derivation

References