While training of deep ensembles or BNNs, we should be able to maximize the BALD score (model disagreement metric) as a regularizer using unlabeled data to improve model diversity and active learning efficiency (or OOD detection) where it matters: for pool or evaluation set data. Given the limited capacity of neural networks, this should help focus a model’s capacity where it is needed most. This research idea was brainstormed together with Polina Kirichenko in late 2021, but I did not get to follow up with experiments as intended. In the meantime, the beautiful “Diversify and Disambiguate: Learning From Underspecified Data” by Lee et al.1 has been shared, and this research builds on top of that.

If you like this idea and want to run experiments/write a paper/make it your own, let us know/add us as co-authors, please (or mention us in the acknowledgments) :hugs:

$$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \newcommand{\w}{\boldsymbol{\omega}} \newcommand{\W}{\boldsymbol{\Omega}} \newcommand{\y}{y} \newcommand{\Y}{Y} \newcommand{\x}{\boldsymbol{x}} \newcommand{\X}{\boldsymbol{X}} $$

Background

Active Learning depends on identifying the most informative points to label. Specifically, active learning consists of selecting the most informative points from an unlabeled pool set using a model trained on the already labeled training set; then, the newly labeled points are added to the training set, and the model is updated. We repeat these steps until we are happy with the performance of the model.

From an information-theoretic point of view, this means finding the unlabeled points in the pool set with the highest expected information gain, which is also referred to as BALD score2 when using Bayesian neural networks and is often denoted to capture the epistemic uncertainty of the model for a given point. When using neural networks, in practice, the BALD scores measure the disagreement between (Monte-Carlo) parameter samples for a given \(\x\), similar to Query by Committee3. To be more specific, the BALD scores looks as follows, where \(\qof{\w}\) is an empirical distribution of the parameters of the ensemble members, or an approximate parameter distribution, e.g. using Monte-Carlo dropout: \[ BALD=\E{\qof{\w}}{\Kale{\qof{\Y \given \x, \w}}{\qof{Y\given x}}}=\varMIof{\opq}{\W;\Y\given \x}, \] where we use the marginal distribution \(\qof{\y\given \x} = \E{\qof{\w}}{\qof{\y\given \x, \w}}\) on the right-hand side. The version using a Kullback-Leibler divergence makes it clearer that we measure the average disagreement (divergence) from the marginal/average distribution. The version using mutual information is more common in Bayesian literature.

For a deep ensemble, the number of available samples is low, e.g., we usually only have 3-5 different ensemble members, which limits how much disagreement can be expressed. In particular, we have \[BALD \le \log \min \{K, C\},\] where \(K\) is the number of ensemble members and \(C\) is the number of classes4. But also, in other cases, it is unlikely that our models will be as diverse as they could be where we want them to be without regularization—for the pool set points, in particular! :bulb:

Research Idea: Boosting Disagreement Exactly Where We Need It

The research idea is that we can boost the disagreement between ensemble members—and make the BALD scores more “informative” for active learning—by encouraging disagreement of the model on the pool set in active learning.

This encourages disagreement of the predictions exactly where we want them: on the pool set, we have to select samples from! Given the limited capacity of the models we use, they might otherwise not express high disagreement everywhere we would like.

Importantly, when we perform active learning under distribution shift, for example, using transductive active learning as in EPIG5, we have access to an unlabeled evaluation set, which contains points in our target domain. We will want to boost disagreement on this evaluation set (instead of on the pool set that might also contain data that is not relevant to our target domain).

This insight directly connects to the recent paper “Diversify and Disambiguate: Learning From Underspecified Data” by Lee et al6, which uses a target set in a non-active learning scenario (more or less, see below for a comparison) to boost disagreement between multiple classifier heads using a bespoke training objective. The objective has two extra regularizers: one to make predictions of different points from the “target domain” independent between different classifier heads, and one to avoid the collapse of classifier heads to a single class (which also leads to independent albeit less interesting predictions).

Beyond Active Learning

Beyond active learning, we can use this diversity booster when training deep ensembles in general: instead of a pool or evaluation set, we could boost disagreement on augmented training data—for example, when using mix-up augmentation7.

To be frank (and not Andreas :nerd_face:), it is peculiar that we augment training data and pretend that it is in-distribution when most likely it looks like nothing we would see—again, I am referring to mix-up augmentation as an example here, but this also applies to other augmentations.

Boosting disagreement on augmentations is also in line with Smith et al, 20188 which found that models (trained without augmentation) will report high epistemic uncertainty for interpolations of samples in image space. On the other hand, when we train with augmentation, epistemic uncertainty collapses on augmented samples, even though these samples should often be considered epistemically ambiguous because we do not have training data for those. Indeed, we would want to encourage the model to only “hypothesize” different explanations for these samples, which hopefully will aid generalization.

Lastly, we can also use this idea for outlier exposure when we use BALD scores for OOD detection. However, see “Research Idea: Intellectually Pleasing Outlier Exposure (with Applications in Active Learning)” for a different idea on improving outlier exposure. In particular, using disagreement boosting for outlier exposure with active learning is most likely a foot gun as we might prefer outliers to other interesting unlabeled samples for active learning.

Related Work

Polina and I discussed this research idea initially in November 2021. Sadly, I did not get around to running experiments yet as initially intended.

In February, “Diversify and Disambiguate: Learning From Underspecified Data” by Lee et al9 was published, which follows a similar motivation in a different use-case.

However, while that paper alludes to active learning and mentions it as related work, they do not run experiments using active learning.

They train different classifier heads on top of the same encoder and minimize the interaction \(\MIof{\Y_1, \Y_2 | \X_1, \X_2}\) for \(\X_1, \X_2\) sampled from their target domain.

Is BALD Not Sufficient As Regularizer?

Using BALD as a regularizer could lead to each member simply predicting a single target independently of the input in the pool set: when there are more ensemble members than classes, maximum disagreement for individual points can be easily reached by multiple ensemble members making identical predictions: imagine two classes and four ensemble members, two always predicting class 0 and the other two always predicting class 1.

Comparison to “Diversify and Disambiguate”

Minimizing the total correlation \(\TCof{Y_1; \ldots; Y_n \given x_1; \ldots ; x_n}\) between the points might be more beneficial because it can encourage diversity beyond a single sample. Recall that in “Research Idea: Approximating BatchBALD via”k-BALD”, we have shown that the total correlation connects the BatchBALD joint mutual information to individual mutual information terms: \[ \MIof{\W; \Y_1, \ldots, \Y_n \given \x_1, \ldots, \x_n} = \sum_i \MIof{\W; \Y_i \given \x_i} - \TCof{Y_1; \ldots; Y_n \given x_1; \ldots ; x_n}. \] By minimizing the total correlation, we can boost the BatchBALD scores overall potentially assuming the individual BALD scores \(\MIof{\W; \Y_i \given \x_i}\) remain constant, which could allow for better separation between different samples for active learning, etc.

Applying the knowledge from “Research Idea: Approximating BatchBALD via”k-BALD”, we also see from the introduced k-BALD approximation that the sum of pairwise mutual information terms \(\sum_{i<j} \MIof{\Y_i; \Y_j \given \x_i, \x_j}\) is a “first-term” approximation of the total correlation using the inclusion-exclusion principle as we have: \[ \TCof{Y_1; \ldots; Y_n \given x_1; \ldots ; x_n} = \sum_{i<j} \MIof{\Y_i; \Y_j \given \x_i, \x_j} - \sum_{i<j<k} \MIof{\Y_i; \Y_j; \Y_k \given \x_i, \x_j, \x_k} + \ldots. \] Now we can circle back to “Diversify and Disambiguate: Learning From Underspecified Data” by Lee et al10: they minimize the mutual information between pairs of predictions. In their notation: \[\mathcal{L}_{\mathrm{MI}}\left(f_i, f_j\right)=D_{\mathrm{KL}}\left(\opp\left(\widehat{y_i}, \widehat{y_j}\right) \| \opp\left(\widehat{y_i}\right) \otimes \opp\left(\widehat{y_j}\right)\right)\] This is indeed \(\MIof{\Y_i; \Y_j \given \x_i, \x_j}\) as can be verified quickly.

However, we made an important assumption: the individual BALD scores \(\MIof{\W; \Y_i \given \x_i}\) have to remain constant. What if we enforce that by maximizing the BatchBALD scores directly, or rather their approximation using 2-BALD?

Maximizing BatchBALD/2-BALD over the whole pool set (or evaluation set) would provide a novel objective.

Regularization == Constraint Optimization

If we want to be Bayesian, we want to minimize a variational inference (VI) objective while encouraging disagreement. To remain “principled,” we can rewrite the regularized objective using a Lagrangian multiplier \(\lambda\) and switch back to a constrained objective: we can perform VI while enforcing a lower-bound constraint on the BatchBALD scores.

Open Questions

  1. Can we maximize BALD scores as a regularizer, or can we maximize 2-BALD (as BatchBALD approximation), and is there a difference to minimizing only the total correlation or an approximation thereof, following Lee et al11?

  2. Would a last-layer approach work as well? The advantage is that the “last-layer ensemble” approach is cheaper than approximate BNNs or deep ensembles.

  3. Otherwise, from a principled BNN point of view, how do we avoid changing the approximate posterior too much? Ideally, when we increase diversity, we do not want to change the marginal predictives for other data, and we want to avoid all pool samples having consistently very high scores—of course, that would not help with picking the most informative samples either.