Why is the Bayesian Model Average the best choice?

Why is the Bayesian model average (BMA) often hailed as the optimal choice for rational actors making predictions under uncertainty? Is this claim justified, and if so, what’s the underlying logic?

No seriously, why? 😂 Please point me to the right references on this.

Below, I enumerate some naive thoughts on this. I will argue in the affirmative and show a simple information-theoretic argument for this.

Concretely, we’ll dive deep into the information-theoretic foundations of BMA optimality. We’ll explore how mutual information between model parameters and predictions forms a lower bound on the average loss gap, and introduce the concept of “reverse mutual information” to gain further insights.

Using a combination of mathematical rigor and intuitive explanations, we’ll examine:

The setting and assumptions behind BMA
The expected loss gap and its implications
The role of mutual information and its reverse counterpart
How BMA compares to other possible hypotheses

(Note that none of this will tell us anything about the true loss gap, which would require knowing the actual data generating process.)

Setting. Let’s assume we have a set of hypotheses \(\W\) and we assign a prior belief \(\pof{\w}\) to each of them. The hypothesis allow us to make predictions \(\Y\) using a likelihood \(\pof{\y \given \w}\). where the Bayesian model average (BMA) is the prediction \(\pof{\y} = \E{\pof{\w}}{\pof{\y \given \w}}\).
Well-Specified. We believe that the true data-generating distribution can be represented by one of the hypotheses.
Rational. Further, assume that our prior beliefs are rational that is if we had reason to assume that our prior beliefs are not consistent with our knowledge of the world so far, we would have chosen different prior beliefs \(\pof{\w}\).
Expected Loss Gap. We measure the difference between two distributions using the (forward) Kullback-Leibler (KL) divergence \(\Kale{\pof{\cdot}}{\qof{\cdot}}\). It is defined as \[\Kale{\pof{\cdot}}{\qof{\cdot}} = \CrossEntropy{\pof{\cdot}}{\qof{\cdot}} - \xHof{\pof{\cdot}} \ge 0,\] where \(\CrossEntropy{\pof{\cdot}}{\qof{\cdot}}\) is the cross-entropy between \(\pof{\cdot}\) and \(\qof{\cdot}\), and is \(0\) if and only if \(\pof{\cdot} = \qof{\cdot}\). The notation for cross-entropy is not standard, but I find it much better than all the alternatives I have seen (and it is well-aligned with the notation for the KL divergence).

The KL divergence tells us how much worse \(\qof{\cdot}\) is compared to \(\pof{\cdot}\) in terms of expected loss assuming that \(\pof{\cdot}\) is the data-generating distribution. Thus, as rational actor with a well-specified model, for any given hypothesis \(\qof{\cdot}\), the expected loss gap is: \[ \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\qof{\Y}}}. \]

At the same time, if we are interested in the expected loss, we would compute the cross-entropy between the (expected) data-generating distribution \(\pof{\y}\) and the hypothesis \(\qof{\y}\): \[ \begin{aligned} \E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\qof{\Y}}} &= \CrossEntropy{\pof{\Y}}{\qof{\Y}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \xHof{\pof{\Y}} \\ &\ge \xHof{\pof{\Y}}. \end{aligned} \] Here, we make use of the fact that the cross-entropy is linear in its first argument (because that’s the expectation), and we end up with the BMA of the predictions \(\pof{\y}.\)
Choosing a Prediction. From this, we can already conclude that the BMA \(\pof{\y}\) is the optimal prediction compared to any other possible hypothesis \(\qof{y}\) (given the assumptions above). More specifically, we easily show that: \[ \begin{aligned} 0 \le \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\qof{\Y}}} &= \E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\qof{\Y}}} - \E{\pof{\w}}{\xHof{\pof{\Y \given \w}}} \\ &= \CrossEntropy{\pof{\Y}}{\qof{\Y}} - \xHof{\pof{\Y \given \W}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \xHof{\pof{\Y}} - \xHof{\pof{\Y \given \W}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \MIof{\Y ; \W} \\ &\ge \MIof{\Y ; \W}. \end{aligned} \] From \(\Kale{\pof{\Y}}{\qof{\Y}} \ge 0\), we see that the inequality gap is minimal when \(\qof{\Y} = \pof{\Y}\), in which case the expected BMA loss gap is just the mutual information between the predictions and the model parameters. (Above derivation follows from the definition of the Kullback-Leibler divergence as “cross-entropy - entropy” and the definition of the mutual information as \(\MIof{\Y ; \W} = \Hof{\Y} - \Hof{\Y \given \W}\).)
Doubly Expected Loss Gap. We can also consider picking the hypothesis according to our prior beliefs \(\pof{\w}\) and measuring how large the expected loss gap is between all possible hypotheses depending on which one turns out to be true. This leads to the following doubly expected loss: \[ \simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y \given \w'}}}, \] which, with the same deduction as in §5, we can simplify to: \[ \begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y \given \w'}}}=\\ &\quad= \ldots\\ &\quad=\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} + \MIof{\Y ; \W} \\ &\quad\ge \MIof{\Y ; \W} \ge 0, \end{aligned} \] with equality if and only if \(\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = 0\). We can rewrite the expected KL divergence between the BMA and any other hypothesis to: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad = \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \\ \end{aligned} \] with \[ \hpof{\y} \propto \exp(\E{\pof{\w'}}{\log \pof{\y \given \w'}}) \left( = \lim_{n \to \infty} \sqrt[n]{\prod_{\w'_i \sim \pof{\w'}} \pof{\y \given \w'_i}} \right), \] where we can also express “\(\exp \opExpectation \log\)” as a limit of \(\sqrt[n]{\prod \cdot}\) of tempered products to make it more intuitive: while the BMA corresponds to an arithmetic average of individual probabilities, \(\hpof{\y}\) is is a geometric average of individual probabilities. See the appendix below for a derivation following the notation in this post or Heskes (1998) ¹ for a more detailed discussion.
1. Equality. This tells us that equality requires \(\hpof{\y} = \pof{\y}\) and \(\hpof{\y} = \pof{\y \given \w'}\) for almost all \(\w'\), so we only have equality when our beliefs have collapsed to a single hypothesis in essence—but then \(\MIof{\Y ; \W} = 0\) as well, and the expected loss gap is \(0\). Otherwise, the inequalities are all strict.
2. Disjunctions and Conjunctions. In other words, \(\hpof{\y}\) is a conjunction of all possible hypotheses (exponential mixture, “AND-projection”) instead of a disjunction of all possible hypotheses like the BMA (linear mixture, “OR-projection”). See Pedro Ortega’s excellent blog post “And, Or, and the Two KL Projections” for intuitions.
3. Doubly Expected Loss. Similarly, we can show for the doubly expected loss: \[ \begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\pof{\Y \given \w'}}} = \\ &\quad = \ldots \\ &\quad = \E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} + \xHof{\pof{\Y}} \\ &\quad = \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} + \Kale{\pof{\Y}}{\hpof{\Y}} + \xHof{\pof{\Y}} \\ &\quad \ge \xHof{\pof{\Y}} \ge 0. \end{aligned} \]
Interpretation:
1. This analysis demonstrates that the Bayesian model average (BMA) provides a lower bound on the expected loss gap between any possible true model of the world \(\pof{\y \given \w}\), according to our prior beliefs, and any other hypothesis \(\qof{\cdot}\) we might compare it to. This result is crucial as it highlights the BMA’s optimality, but it’s important to note that this optimality is contingent on our prior beliefs being accurately represented by \(\pof{\w}\), see §2 and §3.
2. The expected loss when simply using the BMA instead of a picking a specific hypothesis is thus always lower.
  
  In other words, the BMA performs better than any single hypothesis in expectation.
3. We can characterize the expected loss gap between any two hypotheses given our prior beliefs using a mutual information and a reverse mutual information \(\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}\) (this is novel I think), and a bias term \(\Kale{\pof{\Y}}{\hpof{\Y}}\). \[\begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\y \given \w}}{\pof{\y \given \w'}}}=\\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} + \MIof{\Y ; \W}. \end{aligned} \] I denote the latter a bias term because it does not depend on any specific hypothesis \(\w\) at all.
Reverse Mutual Information. “Reverse mutual information” is a good name for \[\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}\] since the mutual information can be written as \[\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y}}},\] and the reverse mutual information swaps the linear mixture \(\pof{\y}\) for the exponential mixture \(\hpof{\y}\) and the forward KL divergence for the reverse KL divergence. We can further motivate this:
1. (R)MI as a Minimization Problem. To clarify these terms further, we can express the mutual information as a minimization problem using the forward KL divergence (ad.: Pedro Ortega pointed me to Haussler and Opper (1997)² as an early reference for the following): \[ \MIof{\Y ; \W} = \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y}}} = \min_{q(\y)} \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{q(\y)}}. \] Similarly, for the reverse mutual information, which substitutes the reverse KL divergence instead of the forward KL divergence, we can show: \[ \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \min_{q(\y)} \E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}}. \]
2. Pseudo-RMI. It is instructive to contrast the RMI with a different term: the “pseudo-reverse mutual information” \[\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}},\] (previously examined in Malinin and Gales (2020) ³) and ask why it is not a good contender for a “reverse mutual information”. As previously shown, it decomposes into a bias term plus the reverse mutual information: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}, \end{aligned} \] and while we can express both mutual information and reverse mutual information as minimization problems, this is not true for the pseudo-reverse mutual information. Thus, we see that the pseudo-reverse mutual information lacks the nice properties of the reverse mutual information.
3. Finite RMI. Additionally, the pseudo-reverse mutual information is not always guaranteed to be finite, while the regular mutual information is—and so is the reverse mutual information! Let’s examine this.
  
  The KL divergence \(\Kale{\pof{\cdot}}{\qof{\cdot}}\) is finite when \(\qof{\cdot}\) has support (\(>0\)) on the same set as \(\pof{\cdot}\). Otherwise, it is infinite. We can trivially construct examples where \(\pof{\y \given \w} = 0\) but \(\pof{\y} > 0\) for some \(y\) and \(\w\) and thus \[\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = \infty.\]
  
  On the other hand, the RMI is always finite: intuitively, if \(\pof{\y \given \w} = 0\) for some \(\w,\) then \(\hpof{\y} = 0\) as well as a geometric average that contains a single \(0\) vanishes overall.
  
  Formally, the minimization perspective tells us that the RMI is bounded by any other distribution with support: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad= \min_{q(\y)} \E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad \le \E{\pof{\w'}}{\Kale{U(\Y)}{\pof{\Y \given \w'}}}, \end{aligned} \] where we choose \(U(\cdot)\) to be the uniform distribution for bounded \(Y\) or a Gaussian for unbounded \(Y\), as either have full support, and thus the RMI is always finite.
4. Another Perspective. We can further rewrite the RMI as (where I use \(\Hof{\pof{x}} = -\log \pof{x}\) when there is nothing to take an expectation over): \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad = \E{\hpof{\y}\,\pof{\w'}}{\xHof{\frac{\pof{\y \given \w'}}{\hpof{\y}}}} \\ &\quad = \E{\hpof{\y}\,\pof{\w'}}{\xHof{\frac{\pof{\y, \w'}}{\hpof{\y}\,\pof{\w'}}}} \\ &\quad = -\Kale{\hpof{\Y}\,\pof{\W'}}{\pof{\Y, \W'}}, \end{aligned} \] which is again very similar to the mutual information, which can be written as: \[ \begin{aligned} \MIof{\Y ; \W} = \Kale{\pof{\Y, \W'}}{\pof{\Y} \, \pof{\W'}}. \end{aligned} \]
5. Simple Lower Bound. Finally using Jensen’s inequality, we find a lower bound on the RMI as \[ \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \ge \Kale{\hpof{\Y}}{\pof{\Y}} \ge 0. \]
  In the appendix, I show how this leads to a further decomposition of the above RMI (which needs to be investigated further in my opinion).
Literature Comparison. The doubly expected loss gap is considered as “Expected Pairwise KL-Divergence” in Malinin (2019) ⁴. The pseudo-reverse mutual information is presented as “reverse mutual information” in Malinin and Gales (2020) ⁵. Finally, Schweighofer et al. (2023) ⁶ propose the doubly expected loss gap as a new information-theoretic measure of epistemic uncertainty, drawing on the previously cited works.

All in all, we see that the Bayesian Model Average (BMA) is optimal and coherent, in the sense that if we accept our prior beliefs as the best representation of our uncertainty, the BMA yields the lowest expected loss (gap) given our model of the world. This optimality is demonstrated through information-theoretic arguments, showing that the mutual information between model parameters and predictions forms a lower bound on the average loss gap.

Furthermore, the introduction of the reverse mutual information provides additional insights into the expected loss gap between hypotheses, offering a more comprehensive understanding of the BMA’s performance in comparison to other possible hypotheses. These findings underscore the BMA’s role as a robust and theoretically grounded approach for making predictions as rational actors in the face of uncertainty. At the same time, we also examined properties of the proposed reverse mutual information.

Acknowledgements: This post was inspired by Pedro Ortega’s excellent blog post “And, Or, and the Two KL Projections”. I also want to thank Pedro Ortega for his feedback and for pointing me to additional literature.

Appendix

This appendix contains a derivation of the exponential mixture and a further decomposition of the reverse mutual information.

Derivation of the Exponential Mixture

To derive the exponential mixture, we start with the expression using a free distribution \(\qof{\w}\) and pull the expectation into the right-hand side: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad = \simpleE{\qof{\y}}\E{\pof{\w'}}{-\log \pof{\y \given \w'}} - \xHof{\qof{\Y}} \\ &\quad = \E{\qof{\y}}{ -\log \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}} - \xHof{\qof{\Y}} \\ &\quad = \E{\qof{\y}}{ -\log \left (\frac{1}{Z} \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}} \right )} - \log Z - \xHof{\qof{\Y}} \\ &\quad = \Kale{\qof{\Y}}{\frac{1}{Z} \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}} - \log Z \\ &\quad = \Kale{\qof{\Y}}{\hpof{\Y}} - \log Z \\ &\quad \ge - \log Z. \end{aligned} \]

where \(Z\) is the partition constant for \(\exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}\) to normalize it into a probability distribution \(\hpof{\y}\), and we have equality when \(\qof{\y} = \hpof{\y}\).

The following step to determine \(Z\) seems almost too easy. \(Z\) is independent of the distribution \(\qof{\y}\), so we simply observe that the equality case provides us with \(- \log Z\): \[ -\log Z = \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}. \] Plugging this into the result above with \(\pof{Y}\) for \(\qof{Y}\), we obtain: \[ \begin{aligned} \E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} &= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}. \end{aligned} \] This provides us both with the intended statements as well as the minimization result that we use to define the RMI.

Further Decomposition of the Reverse Mutual Information

We can further decompose the reverse mutual information into another (reverse) bias term and a double expectation.

We have already seen that: \[ \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \ge \Kale{\hpof{\Y}}{\pof{\Y}} \ge 0. \]
We can rewrite this as decomposition into a bias term and a different kind of mutual information: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad = \Kale{\hpof{\Y}}{\pof{\Y}} \\ &\quad \quad + \left(\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} - \Kale{\hpof{\Y}}{\pof{\Y}}\right) \\ &\quad = \Kale{\hpof{\Y}}{\pof{\Y}} + \E{\pof{\w'}\hpof{\y}}{\Hof{\frac{\pof{\y, \w'}}{\pof{\w'}\,\pof{\y}}}}. \end{aligned} \] The details are left to the reader but are straightforward term rewriting. From the lower-bound, we see that both terms are non-negative.
The second term almost looks like a mutual information, but it very much is not: \[\E{\pof{\w'}\hpof{\y}}{\Hof{\frac{\pof{\y, \w'}}{\pof{\w'}\,\pof{\y}}}}\] viz. \[\E{\pof{\w', \y}}{\Hof{\frac{\pof{\y, \w'}}{\pof{\w'}\,\pof{\y}}}}.\].
Anyhow, this allows us to decompose the doubly expected loss gap into four terms: \[\begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\y \given \w}}{\pof{\y \given \w'}}}=\\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} + \MIof{\Y ; \W} \\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \Kale{\hpof{\Y}}{\pof{\Y}} \\ &\quad \quad + \E{\pof{\w'}\hpof{\y}}{\Hof{\frac{\pof{\y, \w'}}{\pof{\w'}\,\pof{\y}}}} + \MIof{\Y ; \W}. \end{aligned}\] We have two “bias” terms which are pleasingly symmetrical \[\Kale{\pof{\Y}}{\hpof{\Y}} + \Kale{\hpof{\Y}}{\pof{\Y}},\] the double expectation which is new, and a mutual information term. All of which are non-negative.

Sources:

Heskes, Tom. “Bias/variance decompositions for likelihood-based estimators.” Neural Computation 10.6 (1998): 1425-1433.↩︎
Haussler, David, and Manfred Opper. “Mutual information, metric entropy and cumulative relative entropy risk.” The Annals of Statistics 25.6 (1997): 2451-2492.↩︎
Malinin, Andrey, and Mark Gales. “Uncertainty estimation in autoregressive structured prediction.” arXiv preprint arXiv:2002.07650 (2020).↩︎
Malinin, Andrey. Uncertainty estimation in deep learning with application to spoken language assessment. Diss. 2019.↩︎
Malinin, Andrey, and Mark Gales. “Uncertainty estimation in autoregressive structured prediction.” arXiv preprint arXiv:2002.07650 (2020).↩︎
Schweighofer, Kajetan, et al. “Introducing an improved information-theoretic measure of predictive uncertainty.” arXiv preprint arXiv:2311.08309 (2023).↩︎