In this blog post, we explore how some losses could be rewritten as a Bayesian objective using ideas from variational inference—hence, the tongue-in-cheek “Bayesian Appropriation.” This can make it easier to see connections between loss functions and Bayesian methods (e.g. by spotting similar patterns in the wild). We will first provide a brief overview of the necessary background concepts, followed by step-by-step derivations. Overall, this post focuses on a rather simple observation and is mainly intended to be a straightforward reference for myself, but I hope it will be useful to others as well.
Background
In the following background section, we look at the Gibbs/Boltzmann distribution, the information-theoretic notation we are going to use, and the variational inference. These concepts are essential to understanding the main topic of this blog post, so we will provide a brief explanation of each concept to make it more accessible to readers who may not be familiar with them. As always, feel free to skim and skip as is best :-)
Gibbs/Boltzmann Distribution
The Gibbs or Boltzmann distribution is a probability distribution
that describes the probability of a system being in a particular state,
given its energy. It is based on the principle of maximum entropy, which
states that the most likely state of a system is the one with the
highest entropy (or disorder). In simple terms, the Gibbs distribution
helps us understand how systems tend to distribute their energy among
various possible states. The discrete form of the Gibbs distribution is
given by the formula:
Information-Theoretic Notation
Information theory deals with the quantification, storage, and
communication of information. In this blog post, we use
information-theoretic notation to express various quantities related to
probability distributions and their relationships. Some key quantities
we will use include the information content of an event
The information content of an event
The entropy is the expectation over the information content:
We will also use the Kullback-Leibler Divergence (🥬) and the
cross-entropy:
Variational Inference
To connect the Gibbs distribution to a Bayesian principle, we start by looking at variational inference.
Variational inference is a technique used in Bayesian statistics to approximate complex probability distributions, such as the posterior distribution of model parameters given observed data. The main idea behind variational inference is to find an approximate distribution that is close to the true distribution by minimizing a divergence measure (usually the Kullback-Leibler divergence) between the two distributions. This is particularly useful when the true posterior distribution is difficult or computationally expensive to compute directly (or even infeasible to compute).
We can side-step computing the actual posterior by using Bayes’ law
to express it via a prior distribution and a likelihood: let’s assume,
we have model parameters
We are interested in performing Bayesian inference and learning about
We will use a functional approach for now, which is to say, we ignore
Then, we are interested in:
For variational inference, we apply Bayes’ law to obtain:
is the prior term (regularizer), and is the (marginal) loss. is the evidence.
Expressing Losses as a Bayesian Objective
The main idea of this blog post is to show that we rewrite some loss functions and the corresponding optimization objectives into a more Bayesian form by introducing matching likelihoods or joint probabilities. In other words, we can represent the loss using probabilities, which might allows us to connect the loss function to Bayesian principles and variational inference—we do not go into the details of this though but present a few research works that can be a starting point for a more serious exploration of this topic.
General Joint Loss
To demonstrate the idea, we first consider an entropy-regularized
general objective of the form:
By defining a joint probability distribution that is proportional to the exponential of the negative loss function, we can express this objective as a probabilistic objective and find the optimal solution using variational inference as well.
Concretely, we solve for an equivalent (probabilistic) joint
probability likelihood:
Hence, after dropping the
Optimal Solution
What does this tell us about the optimal distribution?
From above, we can read off the optimal solution as
Specifically, we have:
Aside: Invariance to Variable Transformation
Once, we have mapped a loss to a probabilistic form, the
parameterization of
This is because we can re-parameterize the Kullback-Leibler
divergence without change in value:
Above optimization problem might attain a different value for
Maximum Entropy Principle
This connection between loss functions and Bayesian objectives can also be interpreted through the lens of the maximum entropy principle—we already see this from the entropy regularization above.
We can find the optimal solution that minimizes the expected loss while maximizing the entropy of the approximate posterior distribution. This provides an alternative perspective on the relationship between loss functions and Bayesian principles.
Let’s look at a simpler objective:
Importantly, this also tells us two things for
- for fixed entropy
, this is also the minimum of and - for fixed
, this is a maximum entropy distribution (maximum ).
This simply follows from
Another way to argue is that whenever we have an optimization object
that can be split into a “loss” and a “regularizer”, we can obviously
swap the two:
General Likelihood
We can apply this approach to simpler loss functions that only depend on inputs and a separate regularization term.
Specifically, we consider objectives of the form:
We can rewrite the objective as:
Using the joint loss
However, this `generalized likelihood’ is not necessarily a
probabilistic likelihood, and the prior
To see this, let’s follow the steps from the previous section to
obtain the probabilistic form of the objective using a model
The equivalent Bayesian objective is then:
Importantly, the prior
Additional notes:
- We need the
because of the that is required for combining the prior distribution with the likelihood in variational inference. - This is exactly the Boltzmann/Gibbs distribution with a temperature
that cancels out Boltzmann’s constant. To be fair, we could also define
as the constant with being a conversion factor between a temparature (in Kelvin) and a unitless quantity. - We use probability distributions as a tool to find the posterior distribution using an equivalent variational inference objective.
MAP Solution
For a simpler objective, we can choose a Dirac distribution for the approximate posterior distribution and obtain a maximum a posteriori (MAP) solution. This is particularly useful for deterministic neural networks, as it allows us to convert a loss function with a regularization term into an equivalent Bayesian objective.
We can choose a Dirac distribution for
Conclusion
In conclusion, we can construct Bayesian objectives by expressing losses as likelihoods and joint probability distributions. This allows us to leverage insights from variational inference to write down an optimal solution.
Keep in mind, however, that we do not have to use variational inference to find the optimal solution and that writing down the solution does not mean that we can actually compute it. Actually computing it would require computing the partition constants, too.
On the other hand, we don’t always need the partition constant or the normalized density. For example, we can use the unnormalized density to sample from the distribution.
It might also be helpful to know the form of these solutions to recognize them in other contexts and to use them as a starting point.
Please leave some feedback and send along more papers for the next section :)
More Serious Research Papers
Knoblauch, Jeremias, Jack Jewson, and Theodoros Damoulas. “Generalized variational inference: Three arguments for deriving new posteriors.” arXiv preprint arXiv:1904.02063 (2019). https://arxiv.org/abs/1904.02063
Wild, Veit David, Robert Hu, and Dino Sejdinovic. “Generalized variational inference in function spaces: Gaussian measures meet bayesian deep learning.” Advances in Neural Information Processing Systems 35 (2022): 3716-3730. https://arxiv.org/abs/2205.06342
Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. https://arxiv.org/abs/1705.07115v3 https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html
Cobb, Adam D., Stephen J. Roberts, and Yarin Gal. “Loss-calibrated approximate inference in Bayesian neural networks.” arXiv preprint arXiv:1805.03901 (2018). https://arxiv.org/abs/1805.03901
Bissiri, Pier Giovanni, Chris C. Holmes, and Stephen G. Walker. “A general framework for updating belief distributions.” Journal of the royal statistical society. series b, statistical methodology 78.5 (2016): 1103. https://arxiv.org/abs/1306.6430 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082587/
Ganev, Stoil, and Laurence Aitchison. “Semi-supervised learning objectives as log-likelihoods in a generative model of data curation.” arXiv preprint arXiv:2008.05913 (2020). https://arxiv.org/abs/2008.05913
Aitchison, Laurence. “InfoNCE is a variational autoencoder.” arXiv
preprint arXiv:2107.02495 (2021).
https://arxiv.org/abs/2107.02495
Acknowledgements. Thanks to Linara Adilova for valuable discussion & feedback, to Laurence Aitchison for spotting a mistake in the initial derivations and interesting discussions, and to Yarin Gal for pointing to additional related literature.