The equation is used to update the weights in a Boltzmann Machine during training. Its purpose is to minimize the Kullback-Leibler divergence between the target probability distribution \(\htmlClass{sdt-0000000131}{X}probDistribution_{\text{target}}\) and the model's distribution \(\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}\). This gradient descent-based rule adjusts the weights iteratively, aiming to improve the model's accuracy by reducing the discrepancy between the expected and actual joint activations of units \(\htmlClass{sdt-0000000018}{i}\) and \(\htmlClass{sdt-0000000011}{j}\).
\( X \) | This symbol describes the Z-Transform, a mathematical tool used in digital signal processing and control systems to analyze discrete-time signals. |
\( j \) | This is a secondary symbol for an iterator, a variable that changes value to refer to a series of elements |
\( i \) | This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements. |
\( w \) | This symbol describes the connection strength between two units in an Boltzmann machine. |
\( \mu \) | This is the symbol representing the learning rate. |
\( n \) | This symbol represents any given whole number, \( n \in \htmlClass{sdt-0000000014}{\mathbb{W}}\). |
Let us begin by considering the rule for gradient descent:
\[\htmlClass{sdt-0000000083}{\theta} \leftarrow \htmlClass{sdt-0000000083}{\theta} - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R} (\htmlClass{sdt-0000000083}{\theta})\]
In the notation we are using in this equation, this is equivalent to:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu} \htmlClass{sdt-0000000093}{\nabla} \htmlClass{sdt-0000000062}{R}(\htmlClass{sdt-0000000059}{\mathbf{W}})\]
as the model parameters, (\( \htmlClass{sdt-0000000083}{\theta} \)) are equivalent to a weight at some iteration (\(\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n})\))
For the gradient of our risk (\( \htmlClass{sdt-0000000062}{R} \)) we can use our Kullback-Leibler Loss function:
\[\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = - \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]
Using our gradient descent rule with the gradient of the Kullback-Leibler loss function, we get:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu} (\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} )\]
By substituting in the right hand side of the equation for the gradient of our Kullback-Leibler loss, we get:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) - \htmlClass{sdt-0000000106}{\mu}(- \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb))\]
We can now simplify the double negative:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}( \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb))\]
Finally, we can incorporate our \(\frac{1}{\htmlClass{sdt-0000000029}{T}}\) term into the learning rate (change the learning rate from what it is, \( \htmlClass{sdt-0000000106}{\mu} \) to \(\htmlClass{sdt-0000000106}{\mu} \cdot \frac{1}{\htmlClass{sdt-0000000029}{T}}\). This gives us:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]
as required.
Let us now work through an example for a single weight update, using the equation:
\[\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) = \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n}) + \htmlClass{sdt-0000000106}{\mu}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]
We will say that:
Substituting these values in, we find:
\[\begin{align*}\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}(\htmlClass{sdt-0000000117}{n} + 1) &= 0.5 + 0.1(0.7 -0.4)\\&= 0.5 + 0.1(0.3)\\&= 0.5 +0.03\\&= 0.53\end{align*}\]
So \(0.53\) is our answer.