In this blog post, we explore how some losses could be rewritten as a Bayesian objective using ideas from variational inference—hence, the tongue-in-cheek “Bayesian Appropriation.” This can make it easier to see connections between loss functions and Bayesian methods (e.g. by spotting similar patterns in the wild). We will first provide a brief overview of the necessary background concepts, followed by step-by-step derivations. Overall, this post focuses on a rather simple observation and is mainly intended to be a straightforward reference for myself, but I hope it will be useful to others as well.
Background
In the following background section, we look at the Gibbs/Boltzmann distribution, the information-theoretic notation we are going to use, and the variational inference. These concepts are essential to understanding the main topic of this blog post, so we will provide a brief explanation of each concept to make it more accessible to readers who may not be familiar with them. As always, feel free to skim and skip as is best :-)
Gibbs/Boltzmann Distribution
The Gibbs or Boltzmann distribution is a probability distribution that describes the probability of a system being in a particular state, given its energy. It is based on the principle of maximum entropy, which states that the most likely state of a system is the one with the highest entropy (or disorder). In simple terms, the Gibbs distribution helps us understand how systems tend to distribute their energy among various possible states. The discrete form of the Gibbs distribution is given by the formula: \[ P(E) = \frac{e^{-E/kT}}{\sum_{i=1}^N e^{-E_i/kT}} \] where \(E\) is the energy of the system, \(k\) is Boltzmann’s constant, and \(T\) is the temperature. The denominator is a normalizing factor that ensures that the probabilities sum to 1.
Information-Theoretic Notation
Information theory deals with the quantification, storage, and communication of information. In this blog post, we use information-theoretic notation to express various quantities related to probability distributions and their relationships. Some key quantities we will use include the information content of an event \(x\), the entropy of a random variable \(X\), and the Kullback-Leibler Divergence (🥬) between two probability distributions:
The information content of an event \(x\) is \(-\log \pof{x}\): \[\Hof{x} \triangleq \xHof{\pof{x}} \triangleq -\log \pof{x}.\] We commonly use this information content as minimization objective in deep learning (as negative log likelihood, which is just the cross-entropy).
The entropy is the expectation over the information content: \[\Hof{X} = \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{\xICof{\pof{x}}} = \E{\pof{x}}{-\log \pof{x}},\] where we use the following convention (introduced in “A Practical & Unified Notation for Information-Theoretic Quantities in ML” (Kirsch et al, 2021)): when we have an expression with (upper-case) random variables and (lower-case) observations, we take an expectation over all the random variables conditional on all the given observations. For example, the joint entropy \(\Hof{X, y \given Z}\) is: \[\Hof{X, y \given Z}= \E{\pof{x,z \given y}}{\Hof{x,y \given z}}.\]
We will also use the Kullback-Leibler Divergence (🥬) and the cross-entropy: \[\begin{align}\Kale{\pof{X}}{\qof{X}} &= \CrossEntropy{\pof{X}}{\qof{X}} - \xHof{\pof{X}}\\ &= \E{\pof{x}}{\xICof{\qof{x}}} - \E{\pof{x}}{\xICof{\pof{x}}}\\ &= \E{\pof{x}}{-\log{\qof{x}}} - \E{\pof{x}}{-\log{\pof{x}}}. \end{align}\]
Variational Inference
To connect the Gibbs distribution to a Bayesian principle, we start by looking at variational inference.
Variational inference is a technique used in Bayesian statistics to approximate complex probability distributions, such as the posterior distribution of model parameters given observed data. The main idea behind variational inference is to find an approximate distribution that is close to the true distribution by minimizing a divergence measure (usually the Kullback-Leibler divergence) between the two distributions. This is particularly useful when the true posterior distribution is difficult or computationally expensive to compute directly (or even infeasible to compute).
We can side-step computing the actual posterior by using Bayes’ law to express it via a prior distribution and a likelihood: let’s assume, we have model parameters \(\w\) and an observation \(x\). We have a prior distribution \(\pof{\w}\) and a likelihood \(\pof{x \given \w}\).
We are interested in performing Bayesian inference and learning about \(\pof{\w \given x}\). We can do so by using a different parameterized distribution \(\qcof{\psi}{\w}\) and optimizing \(\psi\) to make \(\qcof{\psi}{\w}\) be as close to \(\pof{\w \given x}\) as possible.
We will use a functional approach for now, which is to say, we ignore \(\psi\) and use \(\qof{\w}\).
Then, we are interested in: \[ \arg\min_{\opq} \, \Kale{\qof{\W}}{\pof{\W \given x}} \] For divergences (and the 🥬 in particular), it is 0 iff \(\qof{\w} = \pof{\w \given x}.\) That is, the two distributions match.
For variational inference, we apply Bayes’ law to obtain: \[ \begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given x}}=\Kale{\qof{\W}}{\frac{\pof{x \given \W}\pof{\W}}{\pof{x}}}\\ &\quad =\CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \CrossEntropy{\qof{\W}}{\pof{\W}} \\ &\quad \quad - \xHof{\pof{x}} - \xHof{\qof{\W}}\\ &\quad = \CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \Kale{\qof{\W}}{\pof{\W}} - \xHof{\pof{x}}. \end{aligned} \] This leads directly to the ELBO (though it is an upper-bound in information-theory): \[ \begin{align} &\Kale{\qof{\W}}{\pof{\W \given x}} \ge 0 \\ \iff & \CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \Kale{\qof{\W}}{\pof{\W}} \ge \xHof{\pof{x}} \\ \iff & \E{\qof{\w}} {-\log \pof{x \given \w}} +\Kale{\qof{\W}}{\pof{\W}} \ge - \log \pof{x}, \end{align} \] where:
- \(\Kale{\qof{\W}}{\pof{\W}}\) is the prior term (regularizer), and
- \(\E{\qof{\w}} {-\log \pof{x \given \w}}\) is the (marginal) loss.
- \(-\log \pof{x}\) is the evidence.
Expressing Losses as a Bayesian Objective
The main idea of this blog post is to show that we rewrite some loss functions and the corresponding optimization objectives into a more Bayesian form by introducing matching likelihoods or joint probabilities. In other words, we can represent the loss using probabilities, which might allows us to connect the loss function to Bayesian principles and variational inference—we do not go into the details of this though but present a few research works that can be a starting point for a more serious exploration of this topic.
General Joint Loss
To demonstrate the idea, we first consider an entropy-regularized general objective of the form: \[ \arg\min_{\opq} \; \E{\qof{\w}}{L(x, \w)} - \xHof{\qof{\W}}, \] where we want to minimize a “general joint loss” \(L(x, \w)\) for observation \(x\) given model parameters \(\w\) while maximizing the entropy of the approximate distribution. \(L(x, \w)\) does not have to be normalized and does not have to be non-negative.
By defining a joint probability distribution that is proportional to the exponential of the negative loss function, we can express this objective as a probabilistic objective and find the optimal solution using variational inference as well.
Concretely, we solve for an equivalent (probabilistic) joint probability likelihood: \(L(x, \w) \equiv -\log \pof{x, \w}\). We thus let: \[\pof{x, \w} \propto \exp(-L(x, \w)).\]With normalization constant \(Z_L = \int \exp(-L(x, \w)) \, dx \, d\w,\) we have:\[\pof{x, \w} = \frac{\exp(-L(x, \w))}{Z_L},\] and vice-versa: \[L(x, \w) = -\log \pof{x, \w} - \log Z_L.\]
Hence, after dropping the \(\log Z_L\) constant term, we observe the equivalent objective: \[ \begin{align} &\E{\qof{\w}}{-\log \pof{x, \w}} - \xHof{\qof{\W}} \\ &\quad \overset{(1)}{=} \underbrace{\E{\qof{\w}}{-\log \pof{\w \given x}}}_{\CrossEntropy{\qof{\W}}{\pof{\W \given x}}} - \xHof{\qof{\W}} - \log{\pof{x}} \\ &\quad \overset{(2)}{=} \Kale{\qof{\W}}{\pof{\W \given x}}- \log{\pof{x}}\\ &\quad \ge -\log{\pof{x}}, \end{align}\] where (1) follows from the chain rule (\(\pof{a,b} = \pof{a\given b} \, \pof{b}\)), and (2) is just the definition of the KL (🥬) divergence. This is following the same steps as variational inference.
Optimal Solution
What does this tell us about the optimal distribution?
From above, we can read off the optimal solution as \(\pof{\w\given x}\), which is the posterior distribution for appropriately defined \(\pof{\w}\) and \(\pof{x \given \w}:\) \[\begin{aligned} \pof{\w} &= \int \pof{x, \w} \, dx \\ \pof{x \given \w} &= \frac{\pof{x, \w}}{\pof{\w}}. \end{aligned}\] These terms might often be intractable analytically, but when we can solve them, we have a closed form for an equivalent variational inference problem.
Specifically, we have:\[\pof{\w \given \x} = \frac{\pof{\x, \w}}{\pof{\x}} = \frac{\exp(-L(x, \w))}{\int \exp(-L(x, \w)) \, d\w}.\] Writing this without the normalization constant and keeping in mind that we have to normalize over \(\w\), we have: \[ \arg\min_\opq \; \E{\qof{\w}}{L(x, \w)} - \xHof{\qof{\W}} \propto \exp(-L(x, \w)). \]
Aside: Invariance to Variable Transformation
Once, we have mapped a loss to a probabilistic form, the parameterization of \(\w\) does not matter (assuming smooth reversible transformations).
This is because we can re-parameterize the Kullback-Leibler divergence without change in value: \[\Kale{\qof{X}}{\qof{X}} = \Kale{\qof{Y}}{\pof{Y}},\]when \(Y = f(X)\) and \(f\) is smooth and bijective.
Above optimization problem might attain a different value for \(-\log \pof{x}\), but the optimal solution for \(\opq\) is the same.
Maximum Entropy Principle
This connection between loss functions and Bayesian objectives can also be interpreted through the lens of the maximum entropy principle—we already see this from the entropy regularization above.
We can find the optimal solution that minimizes the expected loss while maximizing the entropy of the approximate posterior distribution. This provides an alternative perspective on the relationship between loss functions and Bayesian principles.
Let’s look at a simpler objective: \[ \arg\min_\opq \; \E{\qof{\w}}{L(\w)} - \xHof{\qof{\w}}. \] First, using the same approach, we see that \(\opq_* \propto \exp(-L(\w))\).
Importantly, this also tells us two things for \(\opq_*\):
- for fixed entropy \(\xHof{\qof{\w}}\), this is also the minimum of \(\E{\qof{\w}}{L(\w)};\) and
- for fixed \(\E{\qof{\w}}{L(\w)}\), this is a maximum entropy distribution (maximum \(\xHof{\qof{\w}}\)).
This simply follows from \(\opq_*\) minimizing above combined objective: if there was a better value of either while the other remains fixed, we would have found a better optimum.
Another way to argue is that whenever we have an optimization object that can be split into a “loss” and a “regularizer”, we can obviously swap the two: \[ \arg\min_\opq \; \E{\qof{\w}}{L(\w)} - \xHof{\qof{\w}}=\arg\max_\opq \; \xHof{\qof{\w}} - \E{\qof{\w}}{L(\w)}. \] All of this is not rocket science, but worth pointing out nonetheless.
General Likelihood
We can apply this approach to simpler loss functions that only depend on inputs and a separate regularization term.
Specifically, we consider objectives of the form: \[ \arg\min_{\opq} \; \E{\qof{\w}}{L(x; \w)} + \Kale{\qof{\W}}{\pof{\W}}, \] where we use the notation \(L(x; \w)\) for the loss for observation \(x\) given model parameters \(\w\). \(L(x; \w)\) does not have to be normalized and does not have to be non-negative.
We can rewrite the objective as: \[\E{\qof{\w}}{L(x; \w)} + \Kale{\qof{\W}}{\pof{\W}} = \E{\qof{\w}}{L(x; \w) + \xHof{\pof{\w}}} - \xHof{\qof{\W}},\] by using the definition of the KL (🥬) divergence and cross-entropy.
Using the joint loss \(L'(x, \w) \triangleq L(x; \w) + \xHof{\pof{\w}}\), we immediately see from section “General Joint Loss” that we have the following equivalent objective with optimal solution: \[ \arg\min_{\opq} \; \E{\qof{\w}}{L(x; \w)} + \Kale{\qof{\W}}{\pof{\W}} \propto \exp(-L'(x, \w)) = \exp(-L(x; \w)) \, \pof{\w},\] where \(\exp(-L(x; \w))\) can be viewed as a `generalized likelihood’.
However, this `generalized likelihood’ is not necessarily a probabilistic likelihood, and the prior \(\pof{\w}\) is not necessarily the prior of an equivalent Bayesian objective.
To see this, let’s follow the steps from the previous section to obtain the probabilistic form of the objective using a model \(\ppof{x, \w}\): \[\begin{aligned} \ppof{x, \w} &= \frac{\exp(-L(x; \w)) \, \pof{\w}}{\int \exp(-L(x; \w)) \, \pof{\w} \, dx \,d\w} \\ \ppof{\w} &= \int \ppof{x, \w} \, dx \\ \ppof{x \given \w} &= \frac{\ppof{x, \w}}{\ppof{\w}}. \end{aligned}\]
The equivalent Bayesian objective is then: \[ \begin{aligned} &\arg\min_{\opq} \; \E{\qof{\w}}{L(x; \w)} + \Kale{\qof{\W}}{\pof{\W}} \\ &\quad = \arg\min_{\opq} \; \E{\qof{\w}}{\ppof{x \given \w}} + \Kale{\qof{\W}}{\ppof{\W}}, \end{aligned} \] with the posterior \(\ppof{\w \given x} \propto \exp(-L(x; \w)) \, \pof{\w}\) as optimal solution.
Importantly, the prior \(\ppof{\w}\) might not be equal \(\pof{\w}\) from the original objective. Substituting the definitions, we have: \[\ppof{\w} \propto \int \exp(-L(x; \w)) \, dx \, \pof{\w},\] which is only \(= \pof{\w}\) when the integral \(\int \exp(-L(x; \w)) \, dx\) is constant (independent of \(\w\)).
Additional notes:
- We need the \(\exp(-\dots)\) because of the \(-\log \dots\) that is required for combining the prior distribution with the likelihood in variational inference.
- This is exactly the Boltzmann/Gibbs distribution with a temperature that cancels out Boltzmann’s constant. To be fair, we could also define \(kT\) as the constant with \(k\) being a conversion factor between a temparature (in Kelvin) and a unitless quantity.
- We use probability distributions as a tool to find the posterior distribution using an equivalent variational inference objective.
MAP Solution
For a simpler objective, we can choose a Dirac distribution for the approximate posterior distribution and obtain a maximum a posteriori (MAP) solution. This is particularly useful for deterministic neural networks, as it allows us to convert a loss function with a regularization term into an equivalent Bayesian objective.
We can choose a Dirac distribution for \(\qcof{\psi}{\w} = \delta_\psi(\w)\). Then we obtain the simpler objective: \[\arg\min_\psi \; L(x; \psi) + H(\pof{\psi}) = \arg\min_\opq \; L(x; \psi) - \log \pof{\psi}.\] Therefore, we can convert a loss for a deterministic neural network of the form: \[\arg\min_\psi \; L(x; \psi) + R(\psi)\]into an equivalent “Bayesian” objective by setting: \[\begin{aligned} \pof{x \given \w} &\propto \exp(-L(x; \w)) \\ \pof{\w} &\propto \int \exp(-L(x; \w)) \,dx \, \exp(-R(\w)). \end{aligned}\] For example, if \(R(\psi) = \psi^2\) (L2 regularization) and \(\int \exp(-L(x; \w)) \,dx\) is constant, we can recognize \(\pof{\w} \propto \exp(-\frac{(\w-0)^2}{1})\) as unit Gaussian distribution.
Conclusion
In conclusion, we can construct Bayesian objectives by expressing losses as likelihoods and joint probability distributions. This allows us to leverage insights from variational inference to write down an optimal solution.
Keep in mind, however, that we do not have to use variational inference to find the optimal solution and that writing down the solution does not mean that we can actually compute it. Actually computing it would require computing the partition constants, too.
On the other hand, we don’t always need the partition constant or the normalized density. For example, we can use the unnormalized density to sample from the distribution.
It might also be helpful to know the form of these solutions to recognize them in other contexts and to use them as a starting point.
Please leave some feedback and send along more papers for the next section :)
More Serious Research Papers
Knoblauch, Jeremias, Jack Jewson, and Theodoros Damoulas. “Generalized variational inference: Three arguments for deriving new posteriors.” arXiv preprint arXiv:1904.02063 (2019). https://arxiv.org/abs/1904.02063
Wild, Veit David, Robert Hu, and Dino Sejdinovic. “Generalized variational inference in function spaces: Gaussian measures meet bayesian deep learning.” Advances in Neural Information Processing Systems 35 (2022): 3716-3730. https://arxiv.org/abs/2205.06342
Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. https://arxiv.org/abs/1705.07115v3 https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html
Cobb, Adam D., Stephen J. Roberts, and Yarin Gal. “Loss-calibrated approximate inference in Bayesian neural networks.” arXiv preprint arXiv:1805.03901 (2018). https://arxiv.org/abs/1805.03901
Bissiri, Pier Giovanni, Chris C. Holmes, and Stephen G. Walker. “A general framework for updating belief distributions.” Journal of the royal statistical society. series b, statistical methodology 78.5 (2016): 1103. https://arxiv.org/abs/1306.6430 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082587/
Ganev, Stoil, and Laurence Aitchison. “Semi-supervised learning objectives as log-likelihoods in a generative model of data curation.” arXiv preprint arXiv:2008.05913 (2020). https://arxiv.org/abs/2008.05913
Aitchison, Laurence. “InfoNCE is a variational autoencoder.” arXiv
preprint arXiv:2107.02495 (2021).
https://arxiv.org/abs/2107.02495
Acknowledgements. Thanks to Linara Adilova for valuable discussion & feedback, to Laurence Aitchison for spotting a mistake in the initial derivations and interesting discussions, and to Yarin Gal for pointing to additional related literature.