Your History

Menu

Gradients of KL Divergence with Respect to Weights

Prerequisites

Energy of a Specific State in a Boltzmann Machine | \(E(\mathbf{s}) = -\sum_{i < j}w_{i j}\mathbf{s}_{i} \mathbf{s}_{j}\)
Boltzmann Distribution of Microstates | \(p(\mathbf{s}) = \frac{1}{Z} \exp\left\{ - \frac{ E(\mathbf{s}) }{ T } \right\}\)
Kullback-Leibler Divergence | \(KL(XprobDistribution, \hat{XprobDistribution}) = \sum_{\mathbf{s} \in S} XprobDistribution(\mathbf{s}) \frac{Xlog(XprobDistribution(\mathbf{s}))}{Xlog(\hat{XprobDistribution}(\mathbf{s}))}\)

Description

This equation describes the gradients with respect to the weights of a Boltzmann machine. It can be used to calculate the gradients to perform gradient descent.

\[\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = - \frac{1}{\htmlClass{sdt-0000000029}{T}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]

Symbols Used:

This symbol describes the Z-Transform, a mathematical tool used in digital signal processing and control systems to analyze discrete-time signals.

\( j \)

This is a secondary symbol for an iterator, a variable that changes value to refer to a series of elements

\( i \)

This is the symbol for an iterator, a variable that changes value to refer to a sequence of elements.

\( T \)

This symbol represents the temperature in a system.

\( \mathbf{W} \)

This symbol represents the matrix containing the weights and biases of a layer in a neural network.

\( \mathbf{s} \)

This symbol represents a full description of the system taken at molecular level.

\( w \)

This symbol describes the connection strength between two units in an Boltzmann machine.

Derivation

  1. Consider the definition of the Kullback-Leibler divergence:
    \[KL(\htmlClass{sdt-0000000131}{X}probDistribution, \hat{\htmlClass{sdt-0000000131}{X}probDistribution}) = \htmlClass{sdt-0000000080}{\sum}_{\htmlClass{sdt-0000000091}{\mathbf{s}} \in \htmlClass{sdt-0000000026}{S}} \htmlClass{sdt-0000000131}{X}probDistribution(\htmlClass{sdt-0000000091}{\mathbf{s}}) \frac{\htmlClass{sdt-0000000131}{X}log(\htmlClass{sdt-0000000131}{X}probDistribution(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\htmlClass{sdt-0000000131}{X}log(\hat{\htmlClass{sdt-0000000131}{X}probDistribution}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}\]
  2. The gradient of the KL divergence with respect to \(\htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}\htmlClass{sdt-0000000011}{j}}\) is:
    \[\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = - \sum_{\htmlClass{sdt-0000000091}{\mathbf{s}}} \htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}) \frac{\delta \htmlClass{sdt-0000000131}{X}log \htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}})}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}}\]
  3. For a Boltzmann Machine, the probability \(\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}})\) is:
    \[p(\htmlClass{sdt-0000000091}{\mathbf{s}}) = \frac{1}{\htmlClass{sdt-0000000077}{Z}} \exp\left\{ - \frac{ \htmlClass{sdt-0000000100}{E}(\htmlClass{sdt-0000000091}{\mathbf{s}}) }{ \htmlClass{sdt-0000000029}{T} } \right\}\]
  4. The derivative of \(\htmlClass{sdt-0000000131}{X}log \htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}})\) with respect to \(\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}\) is:
    \(\frac{\delta \htmlClass{sdt-0000000131}{X}log \htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}})}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = \frac{1}{\htmlClass{sdt-0000000077}{Z}}(\frac{\delta(- \htmlClass{sdt-0000000100}{E}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} - \frac{\delta \htmlClass{sdt-0000000131}{X}log \htmlClass{sdt-0000000077}{Z}}{\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}})\)
  5. Recall the definition of the energy of a specific state of a Boltzmann machine:
    \[\htmlClass{sdt-0000000100}{E}(\htmlClass{sdt-0000000091}{\mathbf{s}}) = -\htmlClass{sdt-0000000080}{\sum}_{\htmlClass{sdt-0000000018}{i} < \htmlClass{sdt-0000000011}{j}}\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}\htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}} \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000011}{j}}\]
  6. The derivative of the energy function with respect to \(\htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}\) is:
    \(\frac{1}{\htmlClass{sdt-0000000077}{Z}} \frac{\delta \htmlClass{sdt-0000000100}{E}(\htmlClass{sdt-0000000091}{\mathbf{s}})}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = -\htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}}\htmlClass{sdt-0000000091}{\mathbf{s}}{\htmlClass{sdt-0000000011}{j}}\)
  7. The average probabilities \( \htmlClass{sdt-0000000131}{X} \)averageProb and \( \htmlClass{sdt-0000000131}{X} \)2averageProb are defined as:
    \(\htmlClass{sdt-0000000131}{X}averageProb = \langle \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}} \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000011}{j}} \rangle_{\htmlClass{sdt-0000000131}{X}probDistribution_{target}} \) and \(\htmlClass{sdt-0000000131}{X}2averageProb = \langle \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}} \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000011}{j}} \rangle_{\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}} \)
  8. Now we substitute in the values into the gradient expression:
    \(\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} = -\sum_{\htmlClass{sdt-0000000091}{\mathbf{s}}} \htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}})(\htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}} \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000011}{j}} - \langle \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000018}{i}} \htmlClass{sdt-0000000091}{\mathbf{s}}_{\htmlClass{sdt-0000000011}{j}} \rangle_{\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}})\)
  9. Now we average over the target distribution and substitute the averages to get:
    \[\frac{\delta KL(\htmlClass{sdt-0000000131}{X}probDistribution_{target}(\htmlClass{sdt-0000000091}{\mathbf{s}}),\htmlClass{sdt-0000000131}{X}probDistribution_{\htmlClass{sdt-0000000059}{\mathbf{W}}}(\htmlClass{sdt-0000000091}{\mathbf{s}}))}{\delta \htmlClass{sdt-0000000092}{w}_{\htmlClass{sdt-0000000018}{i} \htmlClass{sdt-0000000011}{j}}} =- \frac{1}{\htmlClass{sdt-0000000077}{Z}}(\htmlClass{sdt-0000000131}{X}averageProb - \htmlClass{sdt-0000000131}{X}2averageProb)\]
    As required.

References

  1. Jaeger, H. (n.d.). Neural Networks (AI) (WBAI028-05) Lecture Notes BSc program in Artificial Intelligence. Retrieved April 27, 2024, from https://www.ai.rug.nl/minds/uploads/LN_NN_RUG.pdf