Why is the Bayesian Model Average the best choice?

Why is the Bayesian model average (BMA) often hailed as the optimal choice for rational actors making predictions under uncertainty? Is this claim justified, and if so, what’s the underlying logic?

No seriously, why? 😂 Please point me to the right references on this.

Below, I enumerate some naive thoughts on this. I will argue in the affirmative and show a simple information-theoretic argument for this.

Concretely, we’ll dive deep into the information-theoretic foundations of BMA optimality. We’ll explore how mutual information between model parameters and predictions forms a lower bound on the average loss gap, and introduce the concept of “reverse mutual information” to gain further insights.

Using a combination of mathematical rigor and intuitive explanations, we’ll examine:

The setting and assumptions behind BMA
The expected loss gap and its implications
The role of mutual information and its reverse counterpart
How BMA compares to other possible hypotheses

(Note that none of this will tell us anything about the true loss gap, which would require knowing the actual data generating process.)

Setting. Let’s assume we have a set of hypotheses \(\W\) and we assign a prior belief \(\pof{\w}\) to each of them. The hypothesis allow us to make predictions \(\Y\) using a likelihood \(\pof{\Y \given \w}\). The Bayesian model average (BMA) is the prediction \(\pof{\y} = \E{\pof{\w}}{\pof{\y \given \w}}\).
Rational Actor. Further, assume that our prior beliefs are rational that is if we had reason to assume that our prior beliefs are not consistent with our knowledge of the world so far, we would have chosen different prior beliefs \(\pof{\w}\).
Disjunction. We believe that the world is represented by any of the hypotheses with probability \(\pof{\w}\).
Expected Loss Gap. We measure the difference between two distributions using the (forward) Kullback-Leibler divergence \(\Kale{\pof{\cdot}}{\qof{\cdot}}\). It is defined as \[\Kale{\pof{\cdot}}{\qof{\cdot}} = \CrossEntropy{\pof{\cdot}}{\qof{\cdot}} - \xHof{\qof{\cdot}} \ge 0,\] where \(\CrossEntropy{\pof{\cdot}}{\qof{\cdot}}\) is the cross-entropy between \(\pof{\cdot}\) and \(\qof{\cdot}\), and is \(0\) if and only if \(\pof{\cdot} = \qof{\cdot}\).

This tells us how much worse \(\qof{\cdot}\) is compared to \(\pof{\cdot}\) in terms of expected loss assuming that \(\pof{\cdot}\) is the true data-generating distribution. Thus, for a given hypothesis \(\qof{\cdot}\), we are interested in: \[ \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\qof{\Y}}}. \]

At the same time, if we are interested in the expected loss, we would compute the cross-entropy between the true data-generating distribution \(\pof{\Y}\) and the hypothesis \(\qof{\Y}\): \[ \begin{aligned} \E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\qof{\Y}}} &= \CrossEntropy{\pof{\Y}}{\qof{\Y}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \xHof{\pof{\Y}} \\ &\ge \xHof{\pof{\Y}}. \end{aligned} \]
Choosing a Prediction. From this, we can already conclude that the BMA \(\pof{\y}\) is the optimal prediction compared to any other possible hypothesis \(\qof{y}\). More specifically, we can show that: \[ \begin{aligned} 0 \le \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\qof{\Y}}} &= \E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\qof{\Y}}} - \E{\pof{\w}}{\xHof{\pof{\Y \given \w}}} \\ &= \CrossEntropy{\pof{\Y}}{\qof{\Y}} - \xHof{\pof{\Y \given \W}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \xHof{\pof{\Y}} - \xHof{\pof{\Y \given \W}} \\ &= \Kale{\pof{\Y}}{\qof{\Y}} + \MIof{\Y ; \W} \\ &\ge \MIof{\Y ; \W}. \end{aligned} \] From \(\Kale{\pof{\Y}}{\qof{\Y}} \ge 0\), we see that the inequality gap is minimal when \(\qof{\Y} = \pof{\Y}\), in which case the expected BMA loss gap is just the mutual information between the predictions and the model parameters. (Above derivation follows from the definition of the Kullback-Leibler divergence as “cross-entropy - entropy” and the definition of the mutual information \(\MIof{\Y ; \W} = \Hof{\Y} - \Hof{\Y \given \W}\).)
Doubly Expected Loss Gap. We can also consider picking the hypothesis according to our prior beliefs \(\pof{\w}\) and measuring how much the expected loss gap is between all possible hypotheses depending on which one turns out to be true. This leads to the following doubly expected loss: \[ \simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y \given \w'}}}, \] which we can similarly simplify to: \[ \begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\y \given \w'}}}=\\ &\quad= \ldots\\ &\quad=\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} + \MIof{\Y ; \W} \\ &\quad\ge \MIof{\Y ; \W} \ge 0, \end{aligned} \] with equality if and only if \(\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = 0\). We can rewrite the expected KL divergence between the BMA and any other hypothesis to: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad = \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \\ \end{aligned} \] with \[ \hpof{\y} \propto \exp(\E{\pof{\w'}}{\log \pof{\y \given \w'}}) \left( = \lim_{n \to \infty} \sqrt[n]{\prod_{\w'_i \sim \pof{\w'}} \pof{\y \given \w'_i}} \right), \] where we can also express “\(\exp \opExpectation \log\)” as a limit of \(\sqrt[n]{\prod \cdot}\) of tempered products to make it more intuitive: while the BMA corresponds to an arithmetic average of individual probabilities, \(\hpof{\y}\) is is a geometric average of individual probabilities. See the appendix or Heskes (1998) ¹ for a derivation.
1. Equality. This tells us that equality requires \(\hpof{\y} = \pof{\y}\) and \(\hpof{\y} = \pof{\y \given \w'}\) for almost all \(\w'\), so we only have equality when our beliefs have collapsed to a single hypothesis in essence—but then \(\MIof{\Y ; \W} = 0\) as well, and the expected loss gap is \(0\). Otherwise, the inequalities are all strict.
2. Disjunctions and Conjunctions. In other words, \(\hpof{\y}\) is a conjunction of all possible hypotheses (exponential mixture, “AND-projection”) instead of a disjunction of all possible hypotheses like the BMA (linear mixture, “OR-projection”). See Pedro Ortega’s excellent blog post “And, Or, and the Two KL Projections” for intuitions.
3. Doubly Expected Loss. Similarly, we can show for the doubly expected loss: \[ \begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\CrossEntropy{\pof{\Y \given \w}}{\pof{\Y \given \w'}}} = \\ &\quad = \ldots \\ &\quad = \E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} + \xHof{\pof{\Y}} \\ &\quad = \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} + \Kale{\pof{\Y}}{\hpof{\Y}} + \xHof{\pof{\Y}} \\ &\quad \ge \xHof{\pof{\Y}} \ge 0. \end{aligned} \]
Interpretation:
1. This shows that the Bayesian model average provides us with a lower bound on the expected loss gap between any possible true model of the world \(\pof{\y \given \w}\) according to our beliefs and any other possible hypothesis \(\qof{\cdot}\) we might compare it to. Crucially, this is conditional on the prior beliefs modelling the best we can do for modelling the distribution of hypotheses, see (2).
2. The expected loss when simply using the BMA instead of a picking a specific hypothesis is thus always lower.
  
  In other words, the BMA performs better than any single hypothesis in expectation.
3. We can characterize the expected loss gap between any two hypotheses given our prior beliefs using the mutual information and the reverse mutual information \(\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}\) and a bias term \(\Kale{\pof{\Y}}{\hpof{\Y}}\). \[\begin{aligned} &\simpleE{\pof{\w'}}\E{\pof{\w}}{\Kale{\pof{\y \given \w}}{\pof{\y \given \w'}}}=\\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} + \MIof{\Y ; \W}. \end{aligned} \]
Reverse Mutual Information. Reverse mutual information is a good name for \[\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}\] since the mutual information can be written as \[\E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y}}},\] and the reverse mutual information swaps the linear mixture \(\pof{\y}\) for the exponential mixture \(\hpof{\y}\) and the forward KL divergence for the reverse KL divergence. We can further motivate this:
1. (R)MI as a Minimization Problem. To clarify these terms further, we can express the mutual information as a minimization problem using the forward KL divergence [Pedro points to Haussler and Opper (1997)² as an early reference for the following]: \[ \MIof{\Y ; \W} = \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{\pof{\Y}}} = \min_{q(\y)} \E{\pof{\w}}{\Kale{\pof{\Y \given \w}}{q(\y)}}, \] and similarly the reverse mutual information substitutes the reverse KL divergence instead of the forward KL divergence: \[ \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \min_{q(\y)} \E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}}. \]
2. Pseudo-RMI. It is instructive to consider a different term: the “pseudo-reverse mutual information” \[\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}},\] and ask why it might not be a good contender for a “reverse mutual information”: we have already shown it to decompose into a bias term plus the reverse mutual information: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}. \end{aligned} \] and we have already presented one argument, which is that we can derive both mutual information and reverse mutual information from similar minimization problems. This is not the case for the pseudo-reverse mutual information.
3. Finite RMI. Additionally, the pseudo-reverse mutual information is not always guaranteed to be finite, while the regular mutual information is—and so is the reverse mutual information. Let’s examine this.
  
  The KL divergence \(\Kale{\pof{\cdot}}{\qof{\cdot}}\) is finite when \(\qof{\cdot}\) has support (\(>0\)) on the same set as \(\pof{\cdot}\). Otherwise, it is infinite. We can easily construct a case where \(\pof{\y \given \w} = 0\) but \(\pof{\y} > 0\) for some \(y\) and \(\w\). Then \(\E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} = \infty\).
  
  On the other hand, the RMI is always finite: intuitively, if \(\pof{\y \given \w} = 0\) for some \(\w,\) then \(\hpof{\y} = 0\) as well as a geometric average that contains a single \(0\) is \(0\) overall.
  
  Formally, the minimization perspective tells us that the RMI is bounded by any other distribution: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad= \min_{q(\y)} \E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad \le \E{\pof{\w'}}{\Kale{U(\Y)}{\pof{\Y \given \w'}}}, \end{aligned} \] where we choose \(U(\cdot)\) to be the uniform distribution for bounded \(Y\) or a Gaussian for unbounded \(Y\), as either have full support, and thus the RMI is always finite.
4. Another Perspective. We can further rewrite the RMI as: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} = \\ &\quad = \E{\hpof{\y}\,\pof{\w'}}{\xHof{\frac{\pof{\y \given \w'}}{\hpof{\y}}}} \\ &\quad = \E{\hpof{\y}\,\pof{\w'}}{\xHof{\frac{\pof{\y, \w'}}{\hpof{\y}\,\pof{\w'}}}} \\ &\quad = -\Kale{\hpof{\Y}\,\pof{\W'}}{\pof{\Y, \W'}}, \end{aligned} \] which is again very similar to the mutual information, which can be written as: \[ \begin{aligned} \MIof{\Y ; \W} = \Kale{\pof{\Y, \W'}}{\pof{\Y} \, \pof{\W'}}. \end{aligned} \]
5. Simple Lower Bound. Finally using Jensen’s inequality, we find a lower bound on the RMI as \[ \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}} \ge \Kale{\hpof{\Y}}{\pof{\Y}} \ge 0. \]
Literature Comparison. The doubly expected loss gap is considered as “Expected Pairwise KL-Divergence” in Malinin (2019) ³. The pseudo-reverse mutual information is presented as “reverse mutual information” in Malinin and Gales (2020) ⁴. Finally, Schweighofer et al. (2023) ⁵ propose the doubly expected loss gap as a new information-theoretic measure of epistemic uncertainty, drawing on the previously cited works.

All in all, we see that the Bayesian Model Average (BMA) is optimal and coherent, in the sense that if we accept our prior beliefs as the best representation of our uncertainty, the BMA yields the lowest expected loss (gap) given our model of the world. This optimality is demonstrated through information-theoretic arguments, showing that the mutual information between model parameters and predictions forms a lower bound on the average loss gap. Furthermore, the introduction of the reverse mutual information provides additional insights into the expected loss gap between hypotheses, offering a more comprehensive understanding of the BMA’s performance in comparison to other possible hypotheses. These findings underscore the BMA’s role as a robust and theoretically grounded approach for making predictions as rational actors in the face of uncertainty. At the same time, we also examined properties of the proposed reverse mutual information.

Acknowledgements: This post was inspired by Pedro Ortega’s excellent blog post “And, Or, and the Two KL Projections”. I also want to thank Pedro Ortega for his feedback and for pointing me to additional literature.

Appendix

Derivation of the Exponential Mixture

To derive the exponential mixture, we start with the expression using a free distribution \(\qof{\w}\) and pull the expectation into the right-hand side: \[ \begin{aligned} &\E{\pof{\w'}}{\Kale{\qof{\Y}}{\pof{\Y \given \w'}}} \\ &\quad = \simpleE{\qof{\y}}\E{\pof{\w'}}{-\log \pof{\y \given \w'}} - \xHof{\qof{\Y}} \\ &\quad = \E{\qof{\y}}{ -\log \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}} - \xHof{\qof{\Y}} \\ &\quad = \E{\qof{\y}}{ -\log \left (\frac{1}{Z} \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}} \right )} - \log Z - \xHof{\qof{\Y}} \\ &\quad = \Kale{\qof{\Y}}{\frac{1}{Z} \exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}} - \log Z \\ &\quad = \Kale{\qof{\Y}}{\hpof{\Y}} - \log Z \\ &\quad \ge - \log Z. \end{aligned} \]

where \(Z\) is the partition constant for \(\exp \E{\pof{\w'}}{\log \pof{\y \given \w'}}\) to normalize it into a probability distribution \(\hpof{\y}\), and we have equality when \(\qof{\y} = \hpof{\y}\).

The following step to determine \(Z\) seems almost too easy. \(Z\) is independent of the distribution \(\qof{\y}\), so we simply observe that the equality case provides us with \(- \log Z\): \[ -\log Z = \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}. \] Plugging this into the result above with \(\pof{Y}\) for \(\qof{Y}\), we obtain: \[ \begin{aligned} \E{\pof{\w'}}{\Kale{\pof{\Y}}{\pof{\Y \given \w'}}} &= \Kale{\pof{\Y}}{\hpof{\Y}} + \E{\pof{\w'}}{\Kale{\hpof{\Y}}{\pof{\Y \given \w'}}}. \end{aligned} \] This provides us both with the intended statements as well as the minimization result that we use to define the RMI.

Appendix

Derivation of the Exponential Mixture

Sources: