Data Processing Inequalities and Function-Space Variational Inference (#1)

In information theory, the data processing inequality (DPI) is a powerful concept. Informally, it tells us that processing data cannot increase the amount of contained information. In this two-part blog post, we will explore the DPI and its applications to function-space variational inference (FSVI).

In the first part of this blog post, we will provide intuitive explanations and present mathematical proofs of the DPI. Then, in the second part, we will explore the application of the data processing inequality to function-space variational inference and its relationship to variational inference in general.

The goal of this post is to look at the data processing inequality from different angles to better understand it. We will also consider the equality case (which is arguably the best way to understand inequalities).

Background: Information-Theoretic Notation

Details

This blog post uses the same background as the previous blog post “Bayesian Appropriation: General Likelihood for Loss Functions”, so you can skip it if you have read the previous blog posts.

Information theory deals with the communication of information. In this blog post, we use information-theoretic notation to express various quantities related to probability distributions and their relationships. Here are some key concepts we will use:

The information content of an event $x$ is denoted as $H [x]$ and is defined as $- \log p (x)$ . It represents the amount of information needed to describe the occurrence of $x$ given an underlying probability distribution. In machine learning, this information content is often used as a minimization objective, represented as the negative log-likelihood or cross-entropy when averaged over a dataset.

The entropy $H [X]$ of a random variable $X$ is the expectation of its information content:

$H [X] ≜ E_{p (x)} [H [x]] = E_{p (x)} [- \log p (x)] .$

The entropy measures the average amount of information needed to describe the random variable $X$ . It provides a measure of uncertainty or randomness associated with $X$ .

We will also use the Kullback-Leibler divergence (🥬) $D_{KL} (p (X) ‖ q (X))$ and the cross-entropy $H (p (X) ‖ q (X))$ :

$\begin{aligned} H (p (X) ‖ q (X)) & = E_{p (x)} [- \log q (x)] \\ D_{KL} (p (X) ‖ q (X)) & = H (p (X) ‖ q (X)) - H [X] \end{aligned}$

The cross-entropy quantifies the average number of bits needed to encode samples drawn from the true distribution $p (X)$ using a different distribution $q (X)$ . The Kullback-Leibler divergence is a measure of the difference between two probability distributions and captures the additional bits needed to encode samples from $p (X)$ compared to encoding them using the true distribution $q (X)$ .

Now that we have refreshed our notation, let’s delve into the data processing inequality.

Data Processing Inequality

The data processing inequality (DPI) is a fundamental inequality in information theory that states the mutual information between two random variables cannot increase through processing. The original DPI is typically stated for a Markov chain of random variables $X \to Y \to Z$ and relates the mutual information terms as follows:

$I [X; Y] \geq I [X; Z] .$

We can view $\to$ as a processing or transition step, that maps $X$ to $Y$ and $Y$ to $Z$ , whereas the mapping can be deterministic or stochastic. The inequality tells us that processing the random variable $X$ to obtain $Y$ and further processing $Y$ to obtain $Z$ cannot increase the mutual information between $X$ and $Z$ compared to the mutual information between $X$ and $Y$ .

Example 1

The following three scenarios illustrate the data processing inequality using different mappings.

Image Processing Pipeline

Consider an image processing pipeline with the following steps. Let:

$X$ be the original image data;
$Y$ be a compressed version of the image; and
$Z$ be $Y$ after adding blur and pixelation.

In this case, $X$ has more mutual information with $Y$ than with $Z$ . The compression reduces information, but the image is still recognizable. However, after the additional processing of blurring and pixelating, the mutual information between $X$ and $Z$ is further reduced. This gives an intuitive example of how additional processing on data reduces the mutual information with the original data. Each processing step results in some loss of information.

Supervised Learning

Consider a supervised learning pipeline with the following steps. Let

$X$ be the input features;
$Y$ be the intermediate representations learned by the model; and
$Z$ be the model predictions.

Here, $X \to Y \to Z$ forms a Markov chain. The data processing inequality tells us that the mutual information between the inputs $X$ and predictions $Z$ cannot exceed the mutual information between the inputs $X$ and intermediate representations $Y$ :

$I [X; Y] \geq I [X; Z] .$

This makes intuitive sense—the intermediate representations $Y$ are obtained by processing the raw inputs $X$ , so they cannot contain more information about $X$ than $X$ itself. The predictions $Z$ are obtained by further processing $Y$ , so additional information may be lost, reducing the mutual information with the original inputs $X$ .

As a more concrete example, consider an image classification model. Let:

$X$ be the input images;
$Y$ be the activations of the convolutional layers; and
$Z$ be predicted image labels.

The convolutional layers will extract features from the input images, but cannot extract more information than present in the original images. The predicted labels are obtained by further processing these convolutional features, so may lose some fine-grained information about the original inputs.

Autoencoders

An autoencoder compresses the input $X$ into a latent code $Y$ and then tries to reconstruct the original input from the code, producing $\hat{X}$ . Let:

$X$ be the input;
$Y$ be the latent code; and
$\hat{X}$ be the reconstruction;

The data processing inequality tells us again:

$I [X; Y] \geq I [X; \hat{X}] .$

The latent code $Y$ is obtained by compressing $X$ , so cannot contain more information. The reconstruction $\hat{X}$ tries to recover $X$ from $Y$ , but some information may be lost, reducing the mutual information with $X$ .

Intuitively, autoencoders try to preserve as much mutual information between inputs $X$ and reconstructions $\hat{X}$ as possible by learning latent representations $Y$ that compress inputs without losing too much information. The data processing inequality quantifies this information bottleneck.

Proof of the Data Processing Inequality

The proof is simple and connects the DPI to another important inequality.

First we note that the Markov Chain implies that the following factorization of the joint distribution: $p (x, y, z) = p (x) p (y | x) p (z | y) .$ Using this factorization, we can express the mutual information terms:

$\begin{aligned} I [X; Y] & = H [X] - H [X | Y] \\ \geq H [X] - H [X | Z] \\ = I [X; Z] . \end{aligned}$

This relies on $H [X | Y] \leq H [X | Z]$ . Why is this true?

We have the following chain of inequalities: $H [X | Y] = \underset{\overset{(1)}{=} 0}{\underset{⏟}{I [X; Z | Y]}} + H [X | Y, Z] \overset{(2)}{\leq} H [X | Z] .$ (1) follows from the Markov chain property: when $X \to Y \to Z$ , $X$ does not depend on $Z$ at all when conditioned on $Y$ ; and (2) follows from the fact that conditioning reduces entropy, i.e. $H [A | B] \leq H [A] .$

The equality gap $H [X | Y, Z] - H [X | Z]$ corresponds to the mutual information $I [X; Y | Z]$ . This mutual information measures the extra information about $X$ contained in $Y$ that is not already conveyed by $Z$ . It is zero if and only if $X \to Z \to Y$ forms a Markov chain, indicating that $Z$ is a sufficient statistic for $X$ .

Proof of (2) “Conditioning Reduces Entropy”:

We can easily show that conditioning reduces entropy by using the non-negative property of the mutual information:

\begin{aligned} 0 & \leq D_{KL} (p (X, Y) ‖ p (X) p (Y)) \\ = I [X; Y] \\ = H [X] - H [X | Y] \\ ⟹ H [X | Y] & \leq H [X] . \end{aligned}

The fact that conditioning reduces entropy, $H [X] \geq H [X | Y]$ is a very important property by itself and can be seen as another form of the data processing inequality. $H [X | Y]$ tells us how much uncertainty is left after knowing $Y$ . For example, if the mapping from $X$ to $Y$ is injective (“1:1”), then $Y$ contains everything about $X$ . Vice-versa, worst-case, if $Y$ is independent of $X$ , $H [X] = H [X | Y]$ . Intuitively, this says that processing $X$ can only (“at best”) reduce the uncertainty about $X$ and not increase it.

Let’s move on and consider the 🥬 data processing inequality.

🥬 Data Processing Inequality

A similar DPI can be expressed for different distributions $p (x)$ and $q (x)$ and the KL divergence (🥬) between them. It states that if we evolve two distributions using the same transition function, they cannot become less similar. The KL divergence (🥬) is sometimes also referred to as “relative entropy”, so we could also call this the “relative data processing inequality”.

This can be formalized for distributions $p (x)$ and $q (x)$ and a stochastic transition function $X \overset{f (y | x)}{⟶} Y$ : $D_{KL} (p (X) ‖ q (X)) \geq D_{KL} (p (Y) ‖ q (Y)),$ where $p (y | x) = f (y | x) = q (y | x)$ . The marginals after the transition are $p (y) = E_{p (x)} [f (y | x)]$ and $q (y) = E_{q (x)} [f (y | x)]$ , so more explicitly: $D_{KL} (p (X) ‖ q (X)) \geq D_{KL} (E_{p (x)} [f (Y | x)] ‖ E_{q (x)} [f (Y | x)]) .$

Thomas and Cover describe this in their book Elements of Information Theory as “relative entropy never increases” and relate it to the second law of thermodynamics.

Example 2: Comparing Image Distributions

As another example, let:

$p (x)$ be the true distribution of images in a dataset;
$q (x)$ be a generative model that tries to mimic $p (x)$ ; and
$f (y | x)$ be a function that thresholds images $x$ into bilevel black and white images $y$ .

Then $p (y)$ and $q (y)$ will be more difficult to distinguish after the thresholding operation than $p (x)$ and $q (x)$ . Converting to black and white images has lost information that could help distinguish the real and generated distributions.

This provides some intuition for why the 🥬 divergence between distributions decreases under a shared stochastic mapping, as formalized by the 🥬 data processing inequality. Processing through $f (y | x)$ makes the distributions harder to tell apart.

Counter-Example 3: Bayesian Inference

It might be inviting to think that this data processing inequality also applies to Bayesian inference (updating the model parameters based on new evidence). Then one could argue that if two agents start with different prior beliefs but update based on the same evidence, their posterior beliefs will become more similar. However, this intuition is flawed: the data processing inequality does not apply to Bayesian inference. Let’s walk through why.

Let:

$p (ω)$ be an agent’s prior belief;
$q (ω)$ be another agent’s different prior;
$p (ω | x)$ is the posterior after observing data $x$ ; and
$q (ω | x)$ is the other agent’s posterior.

The priors $p (ω)$ and $q (ω)$ may have large divergence, representing very different initial beliefs. However, when conditioning on the same data $x$ , the KL divergence between $p (ω | x)$ and $q (ω | x)$ could increase or decrease—the data processing inequality does not give us any guarantee.

This is because $p (ω)$ and $q (ω)$ are not evolving under the same stochastic mapping. Rather, each prior is mapped to its respective posterior via Bayes’ rule, which operates differently on $p$ and $q$ .

The correct intuition is that observing the same data $x$ does not necessarily bring the posterior beliefs closer together—their divergence depends on the interplay between the specific priors and likelihoods. The data processing inequality does not directly apply to this Bayesian updating scenario.

$D_{KL} (q (Ω) ‖ p (Ω)) ≱ D_{KL} (q (Ω | D) ‖ p (Ω | D)),$

This counterexample highlights the importance of precisely understanding the assumptions underlying conceptual principles like the DPI. While the DPI provides insight about information dynamics in many cases, it does not universally apply, as exemplified here by Bayesian updating under different priors.

Proofs of the 🥬 Data Processing Inequality

We will prove this inequality in two different ways. First, we will develop a “brute-force” proof, and then we will look at a more elegant proof that follows Thomas and Cover. Importantly, we will also consider the equality case in detail.

Brute-force Proof

If $p$ does not have support in $q$ , the inequality is trivially true because then $D_{KL} (p (Y) ‖ q (Y)) = \infty$ .

Thus, now assume that $p$ has support in $q$ . Then, we can brute-force from the cross-entropy: $\begin{aligned} H (p (Y) ‖ q (Y)) & = H (p (Y) ‖ E_{q (x)} [p (Y | x)]) \\ = H (p (Y) ‖ E_{q (x)} [\frac{p (x | Y) p (Y)}{p (x)}]) \\ = H (p (Y) ‖ E_{p (x | Y)} [\frac{q (x)}{p (x)}]) + H (p (Y) ‖ p (Y)) \\ \overset{(1)}{=} H (p (Y) ‖ E_{p (x | Y)} [\frac{q (x)}{p (x)}]) + H (p (Y)) \\ \overset{(2)}{\leq} H (p (X, Y) ‖ \frac{q (X)}{p (X)}) + H (p (Y)) \\ \overset{(3)}{=} H (p (X) ‖ \frac{q (X)}{p (X)}) + H (p (Y)) \\ \overset{(4)}{=} D_{KL} (p (X) ‖ q (X)) + H (p (Y)) \\ ⟺ D_{KL} (p (Y) ‖ q (Y)) & \leq D_{KL} (p (X) ‖ q (X)), \end{aligned}$ where we have used (1) the cross-entropy of a distribution with itself is just the entropy, (2) the cross-entropy is convex and we can apply Jensen’s inequality, (3) the RHS side of the cross-entropy does not depend on $Y$ and we can trivially marginalize it out, and (4) the definition of the Kullback-Leibler divergence as (unnormalized) cross-entropy of a fraction.

This makes it difficult to extract the case for equality, however.

Equality Case

For (2), this is sadly slightly more complex than it might seem on first glance. Let’s unwrap the term: $H (p (Y) ‖ E_{p (x | Y)} [\frac{q (x)}{p (x)}]) = E_{p (y)} [- \log E_{p (x | y)} [\frac{q (x)}{p (x)}]] .$ We take an expectation over $p (y)$ , so we need to look at almost all $p (x | y) \neq 0$ for different almost all $p (y)$ separately to consider equality. $- \log x$ is strictly convex, so we need $F = \frac{q (X)}{p (X)}$ to be constant almost everywhere in the support of $p (x | y)$ for any fixed $y$ . Then we have equality in Jensen’s inequality.

In the following, I will limit myself to the discrete case to avoid having to deal with measure theory¹. To obtain equality, for all $y$ with $p (y) \neq 0$ (i.e. we have support) and for all $x_{1}, x_{2}$ with $p (x_{1} | y), p (x_{2} | y) \neq 0$ , we need $\frac{q (x_{1})}{p (x_{1})} = \frac{q (x_{2})}{p (x_{2})}$ . Equivalently (for the reader, why is then $p (x_{1}) \neq 0 ?$ ): $\begin{aligned} \frac{q (x_{1})}{p (x_{1})} & = \frac{q (x_{2})}{p (x_{2})} \\ ⟺ q (x_{1}) & = \frac{q (x_{2})}{p (x_{2})} p (x_{1}) \end{aligned}$ This means that $q (x) = C_{y} p (x)$ piecewise for all $x$ for which $p (x | y) \neq 0$ for some fixed $y$ with $p (y) \neq 0$ . That is if we keep $y$ fixed, all the $x$ for which $p (x | y) \neq 0$ have the same constant factor $C_{y}$ . Then for all $y$ with $p (y) \neq 0$ , we have equality and overall equality in (2).

If for any $x$ there are multiple $y$ , e.g. $y_{1}, y_{2}$ for which $p (x | y) \neq 0$ , then we have $C_{y_{1}} = C_{y_{2}}$ .

As an example, at the simplest, if this is the case for all $y$ , then $C_{y} = 1$ constant.

Simple & Elegant Proof

Thomas and Cover provide a beautifully simple proof:

What does this mean? Whereas $f (y | x)$ is the ‘forward’ transition function, $p (x | y)$ and $q (x | y)$ are the ‘backward’ transition functions. We only have equality when the backward transition functions are equal (almost everywhere).

The statement on equality is not very informative yet though, so we have to put in a bit more work. Again, this is written for the discrete case.

This time we explicitly use Bayes’ rule to connect the forward and backward transition functions. First, we have to fix $y$ such that $p (y) \neq 0$ (i.e. $y$ is in the support of $p (y)$ ) and then $q (y) \neq 0$ . We have: $\begin{aligned} p (x | y) & = q (x | y) \\ \overset{ass. p (y) \neq 0}{⟺} \frac{f (y | x) p (x)}{p (y)} & = \frac{f (y | x) q (x)}{q (y)} \\ \overset{ass. f (y | x) \neq 0}{⟺} \frac{p (x)}{p (y)} & = \frac{q (x)}{q (y)} \\ ⟺ p (x) & = \frac{p (y)}{q (y)} q (x) . \end{aligned}$ For a given $y$ with $p (y) \neq 0$ , for the equality case, we see that for all $x$ with $f (y | x) \neq 0$ , $p (x)$ and $q (x)$ have to be coupled via piecewise constant factors.

As another example, if $f (y | x) \neq 0$ (has full support) for all possible $x$ , for the equality case we have $p (x) = q (x)$ .

Compared to the previous equality case we went a bit deeper and rewrote the conditions to consider the ratios between $x$ and $y$ . Note we could have shown the same thing in the “brute-force” proof, too.

Altogether, we have see that both $x$ and $y$ are modulated by the same constant factor between $p (\cdot)$ and $q (\cdot)$ . Essentially, this tells us that we could split our support into unconnected sub-domains and examine each individually for the equality case.

Overall Statement

We have the following overall statement:

( $p (x) ≪ q (x)$ means that $q (x) > 0$ implies $p (x) > 0$ , so the 🥬 divergence is not $\infty$ .)

More precisely, for $p (x) ≪ q (x)$ , we have equality when: $\forall y, p (y) \neq 0 \exists C_{y} \in R_{> 0} \forall x, f (y | x) \neq 0 : p (x) = C_{y} q (x) .$

Revisiting Data Processing Inequalities

Now, we can use this to derive a few additional results and also to close the circle to the original data processing inequality.

Jensen-Shannon Divergence

The KL (🥬) divergence is not a metric. The triangle inequality does not hold, and it is not symmetric.

We can symmetrize it to obtain the Jensen-Shannon divergence (JSD). The JSD is defined as the mean of the two 🥬 divergences of the two distributions from their average. It essentially makes the 🥬 divergence symmetric: $\begin{aligned} f (x) & = \frac{p (x) + q (x)}{2} \\ D_{JSD} (p (x) ‖ q (x)) & = \frac{1}{2} D_{KL} (p (x) ‖ f (x)) + \frac{1}{2} D_{KL} (q (x) ‖ f (x)) . \end{aligned}$ Similar approaches can be used to “symmetrize” other concepts; for example matrices: $\frac{1}{2} A + \frac{1}{2} A^{T}$ is also symmetric by construction for any matrix $A$ .

The square root of the Jensen-Shannon divergence is symmetric and satisfies the triangle inequality—it is a metric: the Jensen-Shannon distance.

The JSD Data Processing Inequality

We can also obtain a data processing inequality for the Jensen-Shannon divergence and the Jensen-Shannon distance:

The proof uses the 🥬 data processing inequality: $\begin{aligned} D_{JSD} (p (X) ‖ q (X)) & = \frac{1}{2} D_{KL} (p (X) ‖ f (X)) + \frac{1}{2} D_{KL} (q (X) ‖ f (X)) \\ \geq \frac{1}{2} D_{KL} (p (Y) ‖ f (Y)) + \frac{1}{2} D_{KL} (q (Y) ‖ f (Y)) \\ = D_{JSD} (p (Y) ‖ q (Y)) . \end{aligned}$ We verify $f (y) = \frac{p (y) + q (y)}{2}$ is the average of $p (y)$ and $q (y)$ : $\begin{aligned} f (y) & = E_{f (x)} [f (y | x)] \\ = E_{\frac{p (x) + q (x)}{2}} [f (y | x)] \\ = \frac{1}{2} E_{p (x)} [f (y | x)] + \frac{1}{2} E_{q (x)} [f (y | x)] \\ = \frac{1}{2} p (y) + \frac{1}{2} q (y) . \end{aligned}$ Finally, $p (x), q (x) ≪ f (x)$ , and the equality condition of the 🥬 data processing inequality gives us: $\begin{aligned} D_{KL} (p (X | Y) ‖ f (X | Y)) = 0 \\ D_{KL} (q (X | Y) ‖ f (X | Y)) = 0 \\ ⟹ & p (x | y) = f (x | y) \land q (x | y) = f (x | y) & \forall x, y \\ ⟹ & p (x | y) = q (x | y) & \forall x, y . \end{aligned}$

Mutual Information

The JSD can also be expressed as a mutual information. For $\begin{aligned} Z & \sim Bernoulli (\frac{1}{2}) = f (Z) \\ X | Z = 0 & \sim p (x) \\ X | Z = 1 & \sim q (x), \end{aligned}$ we have: $D_{JSD} (p (X) ‖ q (X)) = I [X; Z] .$ This follows from rewriting the mutual information as a 🥬 divergence: $\begin{aligned} I [X; Z] & = D_{KL} (f (X | Z) ‖ f (X)) \\ = E_{f (z)} [D_{KL} (f (X | Z = z) ‖ f (X))] \\ = \frac{1}{2} D_{KL} (p (x) ‖ f (x)) + \frac{1}{2} D_{KL} (q (x) ‖ f (x)) \\ = D_{JSD} (p (X) ‖ q (X)) . \end{aligned}$

We can generalize this to the Markov chain $Z \to X \to Y$ with $f (z, x, y) = f (z) f (x | z) f (y | x)$ for any distribution $f (z)$ : $\begin{aligned} I [X; Z] & = D_{KL} (f (X | Z) ‖ f (X)) \\ = E_{f (z)} [D_{KL} (f (X | z) ‖ f (X))] \\ \overset{(1)}{\geq} E_{f (z)} [D_{KL} (f (Y | z) ‖ f (Y))] \\ = D_{KL} (f (Y | Z) ‖ f (Y)) \\ = I [Y; Z], \end{aligned}$ where $(1)$ follows from the 🥬 data processing inequality.

This is just the data processing inequality we presented initially.

The equality gap (Jensen gap) is $D_{KL} (f (X | Y, Z) ‖ f (X | Y))$ . We have equality when: $\begin{aligned} D_{KL} (f (X | Y, Z) ‖ f (X | Y)) & = 0 \\ ⟹ I [X; Z | Y] & = 0. \end{aligned}$ This is exactly when $X$ is independent of $Z$ given $Y$ . (Then $Y$ is a sufficient statistic.)

Summary. We have shown that we can derive the data processing inequality using the 🥬 data processing inequality.

Conclusion

In this blog post, we explored the data processing inequality and its applications in information theory. Specifically, we discussed both the original data processing inequality and the 🥬 data processing inequality. We provided derivations and explanations for these inequalities, emphasizing their significance and limitations.

While the data processing inequality provides valuable insights into information dynamics in various scenarios, it is crucial to consider the specific assumptions and conditions under which it holds. The examples and counterexample hopefully demonstrate the nuances of applying these inequalities in different contexts.

By understanding the foundations and limitations of the data processing inequality, we can leverage its insights effectively in information theory and related fields.

Acknowledgements. Thanks to Linara Adilova for valuable feedback again.