On the Total Variation Distance

The definition of the total variation distance can be confusing (at least to me) as it is formulated as a supremum. There is a simpler formulation. We connect the two here and provide some intuitions.

The reason for this post is that recently, I was looking at some more theoretical machine learning papers, and the total variance distance came up several times in connection to the Kullback-Leibler divergence (Pinsker’s inequality) in particular.

The total variation distance \(\delta(P, Q)\) of two probability measures \(P\) and \(Q\) (using Wikipedia’s notation here) is: \[ \delta(P, Q) \coloneqq \sup_{A \in \mathcal{F}} \lvert P(A) - Q(A) \rvert,\] where \(P\) and \(Q\) are defined on a \(\sigma\)-algebra \(\mathcal{F}\) of the sample space \(\Omega\) (\(\mathcal{F}\) contains subsets of \(\Omega\) that we can “measure”.)

The idea expressed here is that we want to find the subset with the largest difference between the two measures.

What does this mean when we look at the densities \(dP(\omega)\) and \(dQ(\omega)\)?

Let’s assume that \(\lvert dP(\omega) - dP(\omega) \rvert\) is well-defined and we can integrate over it—my measure theory is rusty, so I will need to come back to that separately. If \(\Omega\) is a finite or countable set, that is true. Or similarly, when the densities \(\tfrac{dP}{d\omega}\) and \(\tfrac{dQ}{d\omega}\) of \(P\) and \(Q\) are continuous ¹. Then, we can write:

\[ \delta(P, Q) = \frac{1}{2} \int_\Omega \left \lvert \frac{dP}{d\omega}(\omega) - \frac{dQ}{d\omega}(\omega) \right \rvert d\omega. \]

For a categorical distribution, this is simply: \(\delta(P, Q) = \tfrac{1}{2} \sum_{\omega \in \Omega} \lvert P(\omega) - Q(\omega) \rvert\).

This is just the absolute difference between the densities accumlated over the event space \(\Omega\) with a magical constant \(\tfrac{1}{2}\).

What’s the intuition behind this formula?

**Fig 1. Intuition behind the formula.**

First, we realize that for any \(A \in \mathcal{F}\), \(\lvert P(A) - Q(A) \rvert = \lvert P(\bar A) - Q(\bar A) \rvert\), where \(\bar A\) denotes the complement of \(A\): \[ P(\bar A) - Q(\bar A) = (1 - P(A)) - (1 - Q(A)) = Q(A) - P(A).\] But this is not sufficient yet.

Let’s assume that the supremum \(\sup_{A \in \mathcal{F}} \lvert P(A) - Q(A) \rvert\) is attained by an \(A^* \in \mathcal{F}\). Without loss of generality, we also assume that \(P(A^*) \ge Q(A^*)\), so \(\lvert P(A^*) - Q(A^*) \rvert = P(A^*) - Q(A^*)\). Otherwise, we could simply swap \(P\) and \(Q\).

Finite or countable \(\Omega\)

For simplicity, let’s also assume \(\Omega\) is a finite or countable set. What can we say about \(\omega \in A^*\) then?

If there was any \(\omega' \in A^*\) with \(P(\omega') < Q(\omega')\), we could remove it from \(A^*\) and obtain a better \(A^{**} \coloneqq A^* \setminus {\omega'}\): \[ P(A^{**}) - Q(A^{**}) = P(A^*) - Q(A^*) - (\underbrace{P(\omega') - Q(\omega'))}_{> 0} > P(A^*) - Q(A^*).\] Thus, for all \(\omega \in A\), we have \(P(\omega') \ge Q(\omega')\). That is \(P(\omega) - Q(\omega) \ge 0\). The difference has the same sign everywhere in \(A\).

This allows us to rewrite \[ \lvert P({A^*}) - Q({A^*}) \rvert = \lvert \sum_{\omega \in {A^*}} P(\omega) - Q(\omega) \rvert = \sum_{\omega \in {A^*}} \lvert P(\omega) - Q(\omega) \rvert,\] and finally use \(\lvert P({A^*}) - Q({A^*}) \rvert = \lvert P(\bar {A^*}) - Q(\bar {A^*}) \rvert\): \[ \begin{aligned} 2 \lvert P({A^*}) - Q({A^*}) \rvert &= \sum_{\omega \in {A^*}} \lvert P(\omega) - Q(\omega) \rvert + \sum_{\omega \in \bar {A^*}} \lvert P(\omega) - Q(\omega) \rvert \\ &= \sum_{\omega \in \Omega} \lvert P(\omega) - Q(\omega) \rvert. \end{aligned} \] This is the formula we are interested in.

Continuous probability densities \(\tfrac{dP}{d\omega}\) and \(\tfrac{dQ}{d\omega}\)

Above, in essence, we pick as \(A^*\) exactly all \(\omega\) such that \(P(\omega') - Q(\omega')\) has the same sign for all \(\omega' \in A^*\), e.g. \(\ge 0\). The sign does not matter because, we could just as well pick the complement \(\bar A^*\) as we have seen.

We can use the same argument for a more general version. For continuous densities \(\tfrac{dP}{d\omega}\) and \(\tfrac{dQ}{d\omega}\) (the Radon–Nikodym derivative), we can also argue that the sign for \(\tfrac{dP}{d\omega} -\tfrac{dQ}{d\omega}\) will be the same in \(A^*\) for all measurable sets in it and then due to continuity for all \(\omega' \in A^*\):

For any open subset \(S \subseteq A^*\), we have: \[ \int_S \frac{dP}{d\omega}(\omega) -\frac{dQ}{d\omega}(\omega) d\omega \ge 0. \] Otherwise, we could construct a better \(A^{**}\) by removing that subset from \(A^*\). Hence, we also have: \[ \frac{\int_S \frac{dP}{d\omega}(\omega) -\frac{dQ}{d\omega}(\omega) d\omega}{\omega(S)} \ge 0, \] where we denote the underlying base measure here with \(\omega(\cdot)\). Hence as we let \(S \to \{\omega'\}\) ², we have \[ \frac{dP}{d\omega}(\omega') - \frac{dQ}{d\omega}(\omega') \ge 0, \] due to continuity and the fundamental theorem of calculus (or more generally the Lebesgue differentiation theorem). This yields a similar statement.

We choose \(A^*\) simply be the set with positive \(\tfrac{dP}{d\omega}(\omega) - \tfrac{dQ}{d\omega}(\omega)\): \[A^* \coloneqq \left \lbrace \omega \in \Omega: \frac{dP}{d\omega}(\omega) -\frac{dQ}{d\omega}(\omega) \ge 0 \right \rbrace.\]

Again, we have: \[ \begin{aligned} 2 \lvert P(A^*) - Q(A^*) \rvert &= \int_\Omega \left \lvert \frac{dP}{d\omega}(\omega) -\frac{dQ}{d\omega}(\omega) \right \rvert d\omega \\ = \int_\Omega \left \lvert dP(\omega) -dQ(\omega) \right \rvert. \end{aligned} \] The last variant is expressed using the measures directly.

we can sketch a proof using measure theory for that below↩︎
more concretely with bounded eccentricity↩︎