In this blog post, following the previous blog post1 on “Bayesian Appropriation: General Likelihood for Loss Functions”, we will examine and better understand parts of the paper “PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks”2 (“PACTran”), which was presented as an oral at the ECCV 2022. Then, we will take a look at the objective in the follow-up work “Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization”3 (“IRG”), which was presented as a highlight at CVPR 2023.

Both papers are very interesting and I hope that this post will help readers better understand the papers, and more generally, highlight the equivalence between variational inference and optimizing a PAC-Bayes bound.

$$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\omega}} \newcommand{\W}{\boldsymbol{\Omega}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} $$

Background

In the previous blog post on “Bayesian Appropriation: General Likelihood for Loss Functions”, we discussed how we can rephrase certain objectives using variational inference with a general likelihood for loss functions in a Bayesian setting. This post builds on that foundation to explore the connection between optimizing PAC-Bayes bounds from Germain et al., 20154 and variational inference.

In the following two subsections, copied from the previous blog post, we introduce the information-theoretic notation we are going to use, and variational inference.

As always, feel free to skim and skip as is best for you :-)

Information-Theoretic Notation

Information theory deals with the quantification, storage, and communication of information. In this blog post, we use information-theoretic notation to express various quantities related to probability distributions and their relationships. Some key quantities we will use include the information content of an event \(x\), the entropy of a random variable \(X\), and the Kullback-Leibler Divergence (🥬) between two probability distributions:

The information content of an event \(x\) is \(-\log \pof{x}\): \[\Hof{x} \triangleq \xHof{\pof{x}} \triangleq -\log \pof{x}.\] We commonly use this information content as minimization objective in deep learning (as negative log likelihood, which is just the cross-entropy).

The entropy is the expectation over the information content: \[\Hof{X} \triangleq \E{\pof{x}}{\Hof{x}} = \E{\pof{x}}{\xICof{\pof{x}}} = \E{\pof{x}}{-\log \pof{x}},\] where we use the following convention (introduced in “A Practical & Unified Notation for Information-Theoretic Quantities in ML” (Kirsch et al, 2021)): when we have an expression with (upper-case) random variables and (lower-case) observations, we take an expectation over all the random variables conditional on all the given observations. For example, the joint entropy \(\Hof{X, y \given Z}\) is: \[\Hof{X, y \given Z}= \E{\pof{x,z \given y}}{\Hof{x,y \given z}}.\]

In other words, the entropy measures the average amount of information needed to describe a random variable.

We will also use the Kullback-Leibler Divergence (“KL”/🥬) \(\Kale{\pof{X}}{\qof{X}}\) and the cross-entropy7 \(\CrossEntropy{\pof{X}}{\qof{X}}\): \[\begin{align} \CrossEntropy{\pof{X}}{\qof{X}} &\triangleq \E{\pof{x}}{\xICof{\qof{x}}}\\ \Kale{\pof{X}}{\qof{X}} &\triangleq \CrossEntropy{\pof{X}}{\qof{X}} - \xHof{\pof{X}}\\ &= \E{\pof{x}}{\xICof{\qof{x}}} - \E{\pof{x}}{\xICof{\pof{x}}}\\ &= \E{\pof{x}}{-\log{\qof{x}}} - \E{\pof{x}}{-\log{\pof{x}}}. \end{align}\]

Variational Inference

The blog post will show that an optimization objective based on minimizing a PAC-Bayes bound is equivalent to variational inference from Bayesian statistics.

Variational inference is a technique used in Bayesian statistics to approximate complex probability distributions, such as the posterior distribution of model parameters given observed data. The main idea behind variational inference is to find an approximate distribution that is close to the true distribution by minimizing a divergence measure (usually the Kullback-Leibler divergence) between the two distributions. This is particularly useful when the true posterior distribution is difficult or computationally expensive to compute directly (or even infeasible to compute).

We can side-step computing the actual posterior by using Bayes’ law to express it via a prior distribution and a likelihood: let’s assume, we have model parameters \(\w\) and an observation \(x\). We have a prior distribution \(\pof{\w}\) and a likelihood \(\pof{x \given \w}\).

We are interested in performing Bayesian inference and learning about \(\pof{\w \given x}\). We can do so by using a different parameterized distribution \(\qcof{\psi}{\w}\) and optimizing \(\psi\) to make \(\qcof{\psi}{\w}\) be as close to \(\pof{\w \given x}\) as possible.

We will use a functional approach for now, which is to say, we ignore \(\psi\) and use \(\qof{\w}\).

Then, we are interested in: \[ \arg\min_{\opq} \, \Kale{\qof{\W}}{\pof{\W \given x}} \] For divergences (and the 🥬 in particular), it is 0 iff \(\qof{\w} = \pof{\w \given x}.\) That is, the two distributions match.

For variational inference, we apply Bayes’ law to obtain: \[ \begin{aligned} &\Kale{\qof{\W}}{\pof{\W \given x}}=\Kale{\qof{\W}}{\frac{\pof{x \given \W}\pof{\W}}{\pof{x}}}\\ &\quad =\CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \CrossEntropy{\qof{\W}}{\pof{\W}} \\ &\quad \quad - \xHof{\pof{x}} - \xHof{\qof{\W}}\\ &\quad = \CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \Kale{\qof{\W}}{\pof{\W}} - \xHof{\pof{x}}. \end{aligned} \] This leads directly to the ELBO (though it is an upper-bound in information-theory): \[ \begin{align} &\Kale{\qof{\W}}{\pof{\W \given x}} \ge 0 \\ \iff & \CrossEntropy{\qof{\W}}{\pof{x \given \W}} + \Kale{\qof{\W}}{\pof{\W}} \ge \xHof{\pof{x}} \\ \iff & \E{\qof{\w}} {-\log \pof{x \given \w}} +\Kale{\qof{\W}}{\pof{\W}} \ge - \log \pof{x}, \end{align} \] where:

  • \(\Kale{\qof{\W}}{\pof{\W}}\) is the prior term (regularizer), and
  • \(\E{\qof{\w}} {-\log \pof{x \given \w}}\) is the (marginal) loss.
  • \(-\log \pof{x}\) is the evidence.

Seen as a lower bound on \(\log \pof{x}\), the ELBO (Evidence Lower Bound) represents a lower bound on the log marginal likelihood, and maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate and true posterior distributions.

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks

The paper “PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks” introduces a new metric for estimating the transferability of pretrained models to classification tasks using PAC-Bayesian theory.

Transfer learning is an important area of research in machine learning, as it allows models to leverage knowledge learned from one task to improve performance on a different, but related task. This can significantly reduce the amount of training data and computational resources required to train high-performing models. However, selecting the most suitable pretrained model for a given task is a challenging problem, as it requires a good measure for comparing the transferability of different models. The PACTran metric and the IRG objective address this challenge by providing a theoretically grounded metric based on PAC-Bayesian theory and variational inference.

The following is the relevant excerpt from the PACTran paper:


Theorem 1. [19] Given a data distribution D, a hypothesis space \(H\), a prior \(P\), a confidence level \(\delta \in(0,1]\), and \(\lambda>0\), with probability at least \(1-\delta\) over samples \(S \sim D^N\), for all posterior \(Q\) \[ L(Q, D) \leq \hat{L}(Q, S)+\frac{1}{\lambda} D_{K L}(Q \| P)+C(\delta, \lambda, N) \] where \(C(\delta, \lambda, N)\) is a constant independent of the posterior \(Q\).

The hyperparameter \(\lambda\) can be adjusted to balance between the divergence and the constant \(C\) terms, where a common choice is \(\lambda \propto N\) (see \([19,45,14]\). In the transfer learning setting, starting from a pretrained checkpoint \(M\) that is encoded within the prior \(P(h), L(Q, D)\) measures the generalization error of a posterior \(Q\) after it was finetuned over the downstream data \(S\). Furthermore, by minimizing the RHS of the bound ([above]) with respect to \(Q \in \mathcal{Q}_M\), one can obtain a posterior \(Q\) that has low transfer error \(L(Q, D)\). Therefore, to measure the transferability of a pretrained checkpoint \(M\), we define a family of metrics \(P A C\) Tran by optimizing the PAC-Bayesian bound (ignoring the constant \(C\) since it is the same for all checkpoints): \[\min _{Q \in \mathcal{Q}_M} \hat{L}(Q, S)+\frac{1}{\lambda} D_{K L}(Q \| P)\]

Here, \(L\) uses the negative log-likelihood loss function, \(\hat{L}\) is the empirical loss, and \(\mathcal{Q}_M\) is the family of approximate posteriors that we can choose from.


From the previous blog post (and/or the revision of variational inference above), we immediately see that optimizing this PAC-Bayesian bound is equivalent to variational inference, as the objective is the same after dropping the constant \(C\).

This comes as no surprise as this is what “PAC-Bayesian Theory Meets Bayesian Inference” [19]8 is based on. Its abstract summarizes this:

That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization risk bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam’s razor criteria, under the assumption that the data is generated by an i.i.d distribution.

Comparing the objective to the variational inference objective, we see that the PACTran metric is an upper-bound of \(\frac{1}{\lambda} \xHof{\pof{S}} = -\frac{1}{\lambda} \log \pof{S},\) the negative log marginal likelihood for the samples \(S\).

Thus, the PACTran paper implicitly provides further evidence that the Bayesian marginal likelihood is indeed a good measure for model selection—because this is what the paper does: selecting the best model (checkpoint) for transfer learning. For transfer learning, this provides a theoretically grounded metric for comparing pretrained models and selecting the most suitable one for a given classification task using well-known principles from Bayesian model selection.

If the computed PACTran metric is not a good approximation for the log evidence, what else could explain the good performance of the PACTran metric compared to its baselines?

On the one hand, we have seen that the computed PACTran is the same as the variational inference objective. On the other hand, the paper evaluates the quality of the metrics by comparing the rank correlation to fine-tuned models:


B.3 Finetuning

Each pretrained checkpoint was finetuned on each downstream task in the following two ways: 8 attempts were made by full-model finetuning of the checkpoints with batch size 512, weight-decay 0.0001 , with an SGD-Momentum optimizer using a decaying learning schedule with different starting learning rate \(l r\) and stopping iterations \(iter\): \(lr \in\{0.1,0.05,0.01,0.005\}\) and iter \(\in\{10000,5000\};\) 5 attempts were done with top-layer-only finetuning using an L-BFGS solver with weight decay \(\frac{1}{B} \cdot\{0.01,0.1,1 ., 10 ., 100.\},\) where \(B\) is the size of the training set. The ground-truth testing error was set to the lowest test error among all runs.

Thus, I wonder how much this metric is essentially just doing what it measures: for one, assuming a wide-enough prior, the variational inference objective might be dominated by the empirical loss. Furthermore, wide posteriors are assumed to be more robust, and thus have lower loss on the test set (see also the IRG paper below). Finally, weight decay used in the finetuning for evaluation is similar to L2 regularization, which is similar to a Gaussian prior. Thus, the PACTran metric fits the downstream evaluation very well. The PACTran paper only uses a linear probe for estimating downstream performance. Thus, my expectation is that the PACTran metrics deviates from the full fine-tuning perforamance only when the latter changes the embeddings significantly.

Indeed, the follow-up paper, we look at next, examines directly using the PAC-Bayesian bound as an optimization objective to train models (which again is equivalent to variational inference).

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

The follow-up paper “Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization” uses the same PAC-Bayesian bound as the PACTran paper, but this time directly as optimization objective to train robust models.

The following are the relevant excerpts from the paper:


Theorem 2 (Linear-Form PAC-Bayesian Bound [17]). Under the same assumptions as in Theorem 1, for any \(\beta>0\), with a probability at least \(1-\tau\), \[ \begin{aligned} \mathbb{E}_{\theta \in \mathcal{Q}} R(\theta, \mathcal{D}) \leq & \mathbb{E}_{\theta \in \mathcal{Q}} \hat{R}\left(\theta, D^m\right) \\ & +\frac{1}{\beta} \mathrm{KL}(\mathcal{Q} \| \mathcal{P})+C(\tau, \beta, m) \end{aligned} \tag{4} \] where \(C(\tau, \beta, m)\) is a function independent of \(\mathcal{Q}\).

Here \(\beta\) is a hyper-parameter that balances the three terms on the RHS. A common choice is \(\beta \propto m\), i.e. the size of the dataset [13, 17]. Recently, [12] proposed to set \(\mathcal{P}\) and \(\mathcal{Q}\) to univariate Gaussians and used a second-order approximation to estimate the PAC-Bayesian bound. Applying a similar technique, we introduce Theorem 3 (see Appendix. A for the proof) which minimizes the RHS of Eq. 4.

Theorem 3 (Minimization of PAC-Bayesian Bound). If \(\mathcal{P}=\mathcal{N}\left(\mathbf{0}, \sigma_0^2\right)\), and \(\mathcal{Q}\) is also a product of univariate Gaussian distributions, then the minimum of Eq. 4 w.r.t \(\mathcal{Q}\) can be bounded by \[ \begin{aligned} & \min _{\mathcal{Q}} \mathbb{E}_{\theta \in \mathcal{Q}} R(\theta, \mathcal{D}) \\ \leq & \min _{\mathcal{Q}}\left\{\mathbb{E}_{\theta \in \mathcal{Q}} \hat{R}\left(\theta, D^m\right)+\frac{1}{\beta} \mathrm{KL}(\mathcal{Q} \| \mathcal{P})\right\}+C(\tau, \beta, m) \\ = & \min _\theta\left\{\hat{R}\left(\theta, D^m\right)+\frac{\|\theta\|_2^2}{2 \beta \sigma_0^2}+\frac{\sigma_0^2}{2} \operatorname{Tr}\left(\nabla_\theta^2 \hat{R}\left(\theta, D^m\right)\right)\right\} \\ & +C(\tau, \beta, m)+O\left(\sigma_0^4\right). \end{aligned} \]

[12] is the PACTran paper we have examined above. \(R\) is the robust risk, \(\hat{R}\) is the empirical robust risk, and \(\mathcal{Q}\) is the family of approximate posteriors that we can choose from.


Similarly to PACTran, we see that this objective is the same as a variational inference objective after dropping the constant \(C\).

The robust losses used in the IRG objective are designed to improve the model’s generalization performance using adversarially perturbed examples. Common robust losses use adversarially perturbed input samples or are based on a combination of the standard cross-entropy loss and an additional term that penalizes the model for being sensitive to small input perturbations. Please see the paper for more details. By optimizing such robust losses, the IRG objective can be seen as performing “robust” variational inference with the respective general likelihood based on the corresponding robust loss, as derived in the previous blog post.

Conclusion

Looking back, we have interpreted the variational inference objective and equivalently, the optimization of the PAC-Bayesian bound in two different ways:

  • when we care about the value of the optimization objective, we can view it as maximizing the ELBO, which is a lower bound of the log evidence (Bayesian marginal likelihood), or equivalently minimizing the information content of the evidence ($) via an upper bound—with the aforementioned caveat that the approximation might be lacking; and
  • when we care about the model itself, we can see the objective as minimizing the KL divergence between the approximate posterior and the true posterior to be as close to the Bayesian posterior distribution as possible.

In summary, this blog post considers the connection between the PACTran metric, the IRG objective and variational inference by showing that optimizing the respective PAC-Bayes bounds introduced by Germain et al., 20159 is equivalent to variational inference.


Acknowledgements. Thanks to Nan Ding for providing helpful feedback on the draft.