In information theory, the data processing inequality (DPI) is a powerful concept. Informally, it tells us that processing data cannot increase the amount of contained information. In this two-part blog post, we will explore the DPI and its applications to function-space variational inference (FSVI).
In the first part of this blog post, we will provide intuitive explanations and present mathematical proofs of the DPI. Then, in the second part, we will explore the application of the data processing inequality to function-space variational inference and its relationship to variational inference in general.
The goal of this post is to look at the data processing inequality from different angles to better understand it. We will also consider the equality case (which is arguably the best way to understand inequalities).
Background: Information-Theoretic Notation
Details
This blog post uses the same background as the previous blog post “Bayesian
Appropriation: General Likelihood for Loss Functions”, so you can
skip it if you have read the previous blog posts.
Information theory deals with the communication of information. In this blog post, we use information-theoretic notation to express various quantities related to probability distributions and their relationships. Here are some key concepts we will use:
The information content of an event
The entropy
The entropy measures the average amount of information needed to
describe the random variable
We will also use the Kullback-Leibler divergence
(🥬)
The cross-entropy quantifies the average number of bits needed to
encode samples drawn from the true distribution
Data Processing Inequality
The data processing inequality (DPI) is a
fundamental inequality in information theory that states the mutual
information between two random variables cannot increase through
processing. The original DPI is typically stated for a Markov chain of
random variables
We can view
Example 1
The following three scenarios illustrate the data processing inequality using different mappings.
Image Processing Pipeline
Consider an image processing pipeline with the following steps. Let:
be the original image data; be a compressed version of the image; and be after adding blur and pixelation.
In this case,
Supervised Learning
Consider a supervised learning pipeline with the following steps. Let
be the input features; be the intermediate representations learned by the model; and be the model predictions.
Here,
This makes intuitive sense—the intermediate representations
As a more concrete example, consider an image classification model. Let:
be the input images; be the activations of the convolutional layers; and be predicted image labels.
The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.
Autoencoders
An autoencoder compresses the input
be the input; be the latent code; and be the reconstruction;
The data processing inequality tells us again:
The latent code
Intuitively, autoencoders try to preserve as much mutual information
between inputs
Proof of the Data Processing Inequality
The proof is simple and connects the DPI to another important inequality.
First we note that the Markov Chain implies that the following
factorization of the joint distribution:
This relies on
We have the following chain of inequalities:
The equality gap
Proof of (2) “Conditioning Reduces Entropy”:
We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:The fact that conditioning reduces entropy,
Let’s move on and consider the 🥬 data processing inequality.
🥬 Data Processing Inequality
A similar DPI can be expressed for different distributions
This can be formalized for distributions
Thomas and Cover describe this in their book Elements of Information Theory as “relative entropy never increases” and relate it to the second law of thermodynamics.
Example 2: Comparing Image Distributions
As another example, let:
be the true distribution of images in a dataset; be a generative model that tries to mimic ; and be a function that thresholds images into bilevel black and white images .
Then
This provides some intuition for why the 🥬 divergence between
distributions decreases under a shared stochastic mapping, as formalized
by the 🥬 data processing inequality. Processing through
Counter-Example 3: Bayesian Inference
It might be inviting to think that this data processing inequality also applies to Bayesian inference (updating the model parameters based on new evidence). Then one could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference. Let’s walk through why.
Let:
be an agent’s prior belief; be another agent’s different prior; is the posterior after observing data ; and is the other agent’s posterior.
The priors
This is because
The correct intuition is that observing the same data
This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors.
Proofs of the 🥬 Data Processing Inequality
We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.
Brute-force Proof
If
Thus, now assume that
This makes it difficult to extract the case for equality, however.
Equality Case
For (2), this is sadly slightly more complex than it
might seem on first glance. Let’s unwrap the term:
In the following, I will limit myself to the discrete case to avoid
having to deal with measure theory1. To obtain equality, for
all
If for any
As an example, at the simplest, if this is the case for all
Simple & Elegant Proof
Thomas and Cover provide a beautifully simple proof:
What does this mean? Whereas
The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.
This time we explicitly use Bayes’ rule to connect the forward and
backward transition functions. First, we have to fix
As another example, if
Compared to the previous equality case we went a bit deeper and
rewrote the conditions to consider the ratios between
Altogether, we have see that both
Overall Statement
We have the following overall statement:
(
More precisely, for
Revisiting Data Processing Inequalities
Now, we can use this to derive a few additional results and also to close the circle to the original data processing inequality.
Jensen-Shannon Divergence
The KL (🥬) divergence is not a metric. The triangle inequality does not hold, and it is not symmetric.
We can symmetrize it to obtain the Jensen-Shannon
divergence (JSD). The JSD is defined as the mean of the two 🥬
divergences of the two distributions from their average. It essentially
makes the 🥬 divergence symmetric:
The square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality—it is a metric: the Jensen-Shannon distance.
The JSD Data Processing Inequality
We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:
The proof uses the 🥬 data processing inequality:
Mutual Information
The JSD can also be expressed as a mutual information. For
We can generalize this to the Markov chain
This is just the data processing inequality we presented initially.
The equality gap (Jensen gap) is
Summary. We have shown that we can derive the data processing inequality using the 🥬 data processing inequality.
Conclusion
In this blog post, we explored the data processing inequality and its applications in information theory. Specifically, we discussed both the original data processing inequality and the 🥬 data processing inequality. We provided derivations and explanations for these inequalities, emphasizing their significance and limitations.
While the data processing inequality provides valuable insights into information dynamics in various scenarios, it is crucial to consider the specific assumptions and conditions under which it holds. The examples and counterexample hopefully demonstrate the nuances of applying these inequalities in different contexts.
By understanding the foundations and limitations of the data processing inequality, we can leverage its insights effectively in information theory and related fields.
Acknowledgements. Thanks to Linara Adilova for valuable feedback again.