This post discusses potential failure cases of outlier exposure—when using “fake” label distributions for outliers—and presents an intellectually pleasing version of outlier exposure in latent space, treating outliers as purely negative samples from a contrastive point-of-view.
About this post: I repeat my motivation from the last post: during my day-to-day, I read papers and procrastinate from writing my thesis, so I often come up with research ideas that I don’t have the experience, time, or computing resources for at the moment. The following is such a research idea which I consider neat—if it has not been published already. I’d be pleased about any pointers to relevant literature.
A short recap of ‘outlier exposure’
We consider the auxiliary task of separating in-distribution (iD) data, that we want to predict on, from out-of-distribution (OoD) data, that we want to avoid and be able to detect.
‘Outlier exposure’ makes use of OoD data in some way to improve the performance of the OoD detection task. The first recent instance of this idea is by Hendrycks et al. (2018) 1.
Outlier exposure is performed by adding a secondary loss to the regular loss for iD training data (usually a cross-entropy loss + other regularization). For classification models, usually this secondary loss is another cross-entropy loss using uniform predictions for OoD data (1/C for all classes, where C is the number of classes).
In the following, I will use OoD and avoid the term outlier.
Issues
There are two issues with assigning a label distribution to OoD data: it can do bad things to your latent space—and, we prefer a nice, happy, well-structured latent space usually—and it works accidentally because we typically use balanced datasets for benchmarking. (I guess these insights constitute part of the contributions because I have not seen them detailed before.)
Why uniform predictions for OoD, or: Nobody expects the class-imbalanced dataset!
Some works like, e.g., Liu et al. (2020)2 argue that assigning uniform predictions is the optimum from a risk minimization perspective.
However, what about when our iD datasets that we want to predict on have (strong) class imbalances?
Indeed, assigning uniform predictions for OoD data when the iD data has class imbalances is a terrible idea as it would change the learned class prior \(p(y)\). After all, both losses propagate through a single classifier that is first and foremost meant for iD predictions.
On the other hand, assigning the class prior to OoD samples when there is strong class imbalance does not work either: Why does outlier exposure with uniform predictions work well in the first place? Assuming—and this is an often silent assumption (Kirsch et al, 2021 3)—that our iD samples have no ambiguities—i.e., the target distribution for each iD input is one-hot—outlier exposure teaches the model to assign latents to OoD samples that are far away from all classes and end up in the center between the classes in latent space. The latent space of softmax cross-entropy classifiers is examined by Hess et al. (2020)4 and Pearce et al. (2021) 5. Now, if the class prior is imbalanced, we would encourage latents for OoD samples to lie closer to the majority classes, making it harder to detect OoD data.
For example, confidence and predictive entropy would fail badly when we had a very strong class imbalance. But so would density-based OoD approaches like DDU (Mukhoti et al, 2021 6) that use the latent space directly because we would also latents from iD and OoD samples would lie closer together, making separation difficult.
Poor Latent Space
The other issue is one for other downstream tasks beyond OoD detection: again, if we use outlier exposure with uniform predictions, we essentially teach the model to put all OoD samples in the same area of our latent space (as we have seen above). This is bad because OoD samples are usually not similar, rather the opposite, and yet here we are trying to get them to share the same area in latent space!
There is growing research (and a strong belief) that a well-regularized latent space is beneficial. Some works try to achieve desirable properties by encouraging bi-Lipschitzness, others by using self-supervised approaches and pretraining.
Yet outlier exposure using uniform predictions encourages OoD samples to cluster together, potentially distorting the latent space a lot—almost as if we had made up a new class “OoD” and assigned a separate label to OoD samples. Note that this analogy fails because a separate label class would not even end up in the center between other labels but along them, so be probably even worse. See Hess et al. (2020) 7, and Pearce et al. (2021) 8 for intuitions.
Overall, this is not good.
Alternative: outlier exposure with a “soft” contrastive touch
As an intellectually more pleasing alternative, we can take a contrastive loss perspective: we can use whatever OoD data we have available as negative samples and “push away” OoD samples from iD samples. Unlike assigning uniform predictions, this does not overly constrict or affect the latent space.
The main insight is that OoD samples do not have a valid label distribution; therefore, we should not invent one for them. By performing outlier exposure in latent space, we precisely avoid having to come up with a label distribution for OoD samples. I like this.
Training would consist a combination of cross-entropy (or another contrastive loss objective) as regular loss and outlier exposure samples would be used as negative examples with a secondary contrastive loss term. For example, we could use a Parzen estimator to avoid overlap between iD and OoD samples, such that our combined loss \(\mathcal{L}\) would be:
\[ \mathcal{L} = \mathbb{E}_{\hat{p}(x_{id}, y_{id})} [-\log p_\theta(y_{id} \mid x_{id})] + R + \lambda \mathbb{E}_{\hat{p}(x_{id}, x_{ood})} [\Vert Z_{id} - Z_{ood} \Vert^2 / \rho], \]
where the first term is a regular cross-entropy loss, \(R\) is any other regularizer term in our loss (weight-decay, etc), \(\rho\) is the length scale, \(\lambda\) is a Lagrange multiplier for the (constrastive) loss for negative OoD samples, and \(Z_{id/out} = f_\theta(X_{id/ood})\) are the feature embeddings for \(X_{id/out}\), respectively. The third term is log density of a Gaussian with length scale \(\rho\) (hence Parzen estimator) and acts as repulsive term as it becomes minimal when we maximize the distance between possible \(z_{id}\) and \(z_{out}\).
Hendrycks et al. (2010)9 propose using a margin ranking loss on the log probabilities between iD and OoD samples when the overall objective is density estimation. Here, we make a similar but more general case that doing outlier exposure in latent space (and not predictive distribution space) is the intellectually pleasing thing to do.
Connection to Information Bottleneck
If we wanted a theoretically more principled formulation (in an information-theoretic sense), we can connect the objective to information bottlenecks (Tishby et al., 2015 10; and see Kirsch et al., 2020 11 for another neat overview—especially the first version of the paper). This is because information bottleneck objectives naturally include the latent embeddings in their objective.
The standard information bottleneck objective is (in a rewritten form): \[ \min H[Y_{id}|Z_{id}] + \beta H[Z_{id}|Y_{id}], \] which then is approximated using an actual model with parameters \(\theta\) as \[ \min_\theta H_\theta[Y_{id}|Z_{id}] + \beta H_\theta[Z_{id}|Y_{id}],\] where the first term is essentially the regular cross-entropy loss (to avoid overlap for different classes) and the second term is a regularization term that concentrates the latents for each class as much as possible.
The goal of having no overlap between iD and OoD samples in latent space can be expressed as a mutual information maximization of \(I[Z;T]\), where \(T\) is a Bernoulli variable that is, e.g., \(0\) for OoD data and \(1\) for OoD data, and \(Z\) are the embeddings. Note that this mutual information runs as an expectation over both iD and OoD samples. Then, \(I[Z;T] = H[T] - H[T|Z]\) is maximal when iD and OoD samples perfectly separate.
Tying this back to the original loss above, we can approximate \(I[Z;T]\) using a model like, e.g., a Parzen estimator and hence use an approximation \(I_\psi[Z;T]\). This gives us an objective that fits well into information bottleneck theory: \[ \min H[Y_{id}|Z_{id}] + \beta H[Z_{id}|Y_{id}] - \lambda I[Z ; T].\] When we use an actual model with parameters \(\psi\) to estimate \(I_\psi[Z;T] = H[T] - H_\psi[T|Z]\), we note that \(H[T]\) is just the entropy of a Bernoulli variable for the iD/OoD probability. It is constant for a given training set of iD and OoD samples and hence can be dropped from the optimization objective.
We end up with the following objective:
\[ \min_\theta H_\theta[Y_{id}|Z_{id}] + \beta H_\theta[Z_{id}|Y_{id}] + \lambda H_\psi[T|Z].\]
This means we can view this form of outlier exposure as training a separate binary classifier and minimizing its cross-entropy. To avoid negatively affecting the iD classification (or other downstream tasks), we have to make sure that this binary classifier is more “expressive” than the softmax layer of the iD classifier we are probably using—hence a non-parametric Parzen estimator seems ideal, and we end up with a simple repellent term like the one above.
While this section introduced an IB-based objective, we have also examined a different perspective on OoD detection using outlier exposure: by simply training a binary classifier to separate iD and available OoD data. Importantly, this approach does not require a separate label distribution.
Applications/Experiments
For OoD detection, this form of outlier detection ought to work well with feature-space density methods like DDU 12 or Mahalanobis distances 13, not so much with using softmax entropy or confidence, but then those are not a good idea anyway (Kirsch et al., 2021 14).
Disambiguating epistemic uncertainty estimation and OoD detection
Moreover, we have a hybrid for estimating epistemic uncertainty separately from the OoD detection problem.
This can be attractive for active learning: while for OoD detection, we fit a density estimator on the iD latents; to estimate epistemic uncertainty, we fit a density estimator on all available data, including the OoD data used during training. For the latter, we can use the negative log density as a proxy for epistemic uncertainty for example.
It is a neat conceptual insight (I think!) that OoD detection and epistemic uncertainty are indeed not the same thing when using outlier exposure—however, when the training data does not contain any OoD data, they do match 15.
Thus, if we use outlier exposure, you can use model disagreement (query-by-commitee) for deep ensembles or mutual information estimates (for BNNs) for OoD detection but only as a proxy for epistemic uncertainty!
Active Learning
The difference between OoD detection and epistemic uncertainty becomes clearer in active learning when the pool set contains OoD data we simply do not want to sample.
Here it is clear that active learning benefits from acquiring samples with high epistemic uncertainty, that is samples which the model has not seen before, and not samples that it considers as OoD. Indeed, in active learning, the acquisition function is considered as a proxy for epistemic uncertainty often.
For example, imagine both CIFAR-100 and CIFAR-10 are in the pool set. We do not want to learn about CIFAR-10 but only CIFAR-100. Now in a setting where we have not labeled any data yet, active learning will eventually sample CIFAR-10 samples, these samples will be categorized as OoD and added as such to the training set using our outlier exposure method. After training with outlier exposure, the model will gradually learn to avoid these samples.
This approach could work well for deterministic models.
The biggest question is what density estimator to use: a simple GMM (GDA/LDA) as used by 16 or 17 does not work for OoD samples because they will not cluster, which is unfortunately the goal of using a contrastive approach with negative samples for outlier exposure after all.
A combination of a simple GMM for iD data and a Parzen estimator might work. This is an open question though.
Fine-grained research questions
As a simple baseline: could we just train a regular iD softmax classifier (single linear layer + softmax on top of the latents) and a separate sigmoid binary OoD classifier (single linear layer + sigmoid)? The latents might already cluster well for the binary classifier. An overall density can be computed using the training ratio of iD and OoD samples, which can then be used to estimate epistemic uncertainty.
What are good ways to measure issues arising from distortions of the latent space? Downstream tasks? Adversarial robustness?
What other models than a Parzen estimator or sigmoid linear layer can we use for the binary iD/OoD classifier? And how do they influence the underlying latent space during training?
While a Parzen estimator leads to a nice and simple secondary loss term, does it makes for a good estimator of epistemic uncertainty (in combination with a GMM) for active learning?
How well does this “soft” contrastive outlier exposure method perform on regular OoD benchmarks (as another potential application)? Do we see benefits for robustness because this approach does not distort the latent space as much as outlier exposure with uniform predictions?
Please cite this blog post if you find this write-up useful, and it inspires your own research—and please drop me a comment
Thanks!