Your History

Menu

Weight Update Rule for Boltzmann Machines

Prerequisites

Update rule of the Gradient Descent | \(\theta \leftarrow \theta - \mu \nabla R (\theta)\)
Gradients of KL Divergence with Respect to Weights | \(\frac{\delta KL(XprobDistribution_{target}(\mathbf{s}),XprobDistribution_{\mathbf{W}}(\mathbf{s}))}{\delta w_{i j}} = - \frac{1}{T}(XaverageProb - X2averageProb)\)

Description

The equation is used to update the weights in a Boltzmann Machine during training. Its purpose is to minimize the Kullback-Leibler divergence between the target probability distribution \(\htmlClass{sdt-0000000131}{X}probDistribution_{\text{target}}\) and the model's distribution \(\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}\)​. This gradient descent-based rule adjusts the weights iteratively, aiming to improve the model's accuracy by reducing the discrepancy between the expected and actual joint activations of units \(\htmlClass{sdt-0000000018}{i}\) and \(\htmlClass{sdt-0000000011}{j}\).

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]

Symbols Used:

This symbol describes the Z-Transform, a mathematical tool used in digital signal processing and control systems to analyze discrete-time signals.

\( j \)

This is a secondary symbol for an iterator, a variable that changes value to refer to a series of elements

\( i \)

This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.

\( w \)

This symbol describes the connection strength between two units in an Boltzmann machine.

\( \mu \)

This is the symbol representing the learning rate.

\( n \)

This symbol represents any given whole number, \( n \in \htmlClass{sdt-0000000014}{\mathbb{W}}\).

Derivation

Let us begin by considering the rule for gradient descent:

\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]

In the notation we are using in this equation, this is equivalent to:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000059}{\mathbf{W}})\]

as the model parameters, (\( \htmlClass{sdt-0000000083}{\theta} \)) are equivalent to a weight at some iteration (\(\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n})\))

For the gradient of our risk (\( \htmlClass{sdt-0000000062}{R} \)) we can use our Kullback-Leibler Loss function:

\[\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = - \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]

Using our gradient descent rule with the gradient of the Kullback-Leibler loss function, we get:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu} (\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} )\]

By substituting in the right hand side of the equation for the gradient of our Kullback-Leibler loss, we get:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu}(- \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb))\]

We can now simplify the double negative:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}( \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb))\]

Finally, we can incorporate our \(\frac{1}{\htmlClass{sdt-0000000029}{T}}\) term into the learning rate (change the learning rate from what it is, \( \htmlClass{sdt-0000000106}{\mu} \) to \(\htmlClass{sdt-0000000106}{\mu} \cdot \frac{1}{\htmlClass{sdt-0000000029}{T}}\). This gives us:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]

as required.

Example

Let us now work through an example for a single weight update, using the equation:

\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]

We will say that:

Substituting these values in, we find:

\[\begin{align*}\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) &= 0.5 + 0.1(0.7 -0.4)\\&= 0.5 + 0.1(0.3)\\&= 0.5 +0.03\\&= 0.53\end{align*}\]

So \(0.53\) is our answer.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 27, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf