This note explores a seemingly simple yet surprisingly profound example of how rational agents can diverge in their beliefs even when exposed to identical evidence. While the mathematical model we’ll examine is highly simplified, its core mechanism offers a potential lens through which to understand the complex dynamics of real-world phenomena like belief polarization, even in political discourse.

After reading this post, you’ll gain a new understanding of why simply presenting more facts doesn’t always lead to agreement and can initially even worsen disagreements—even for absolutely rational actors. We’ll see how this can happen through a concrete, albeit simplified, mathematical example. While we stay firmly in the realm of Bayesian machine learning, this insight might help explain current events and polarized discussions, without needing to invoke irrationality or bad faith. The mathematical framework we’ll explore reveals how divergence can emerge naturally from the process of rational belief updating itself. It invites us to consider how these simplified dynamics might play a role, alongside other factors, in the more complex world of human beliefs.

We often assume that being exposed to more evidence should bring beliefs closer together. Intuitively, if two people start with different opinions but are then shown the same data, their views should converge. It seems logical, right?

This intuition is so strong that it seems baked into fundamental principles of information theory—specifically, the data processing inequality would seem to suggest that the distance between beliefs (measured by KL divergence) should shrink when we update them with the same data.

But sometimes, the opposite happens. Sometimes, evidence can actually increase the divergence between beliefs, pushing people further apart. Does this mean that people are irrational?

Let’s explore this surprising phenomenon by constructing a simple yet powerful example using Bayesian statistics: even assuming rational actors, that update their beliefs according to Bayes’ rule, share support, and are well-specified, we find that more evidence can actually increase the divergence between the actors’ beliefs.

$$\require{mathtools} \DeclareMathOperator{\opExpectation}{\mathbb{E}} \newcommand{\E}[2]{\opExpectation_{#1} \left [ #2 \right ]} \newcommand{\simpleE}[1]{\opExpectation_{#1}} \newcommand{\implicitE}[1]{\opExpectation \left [ #1 \right ]} \DeclareMathOperator{\opVar}{\mathrm{Var}} \newcommand{\Var}[2]{\opVar_{#1} \left [ #2 \right ]} \newcommand{\implicitVar}[1]{\opVar \left [ #1 \right ]} \newcommand\MidSymbol[1][]{% \:#1\:} \newcommand{\given}{\MidSymbol[\vert]} \DeclareMathOperator{\opmus}{\mu^*} \newcommand{\IMof}[1]{\opmus[#1]} \DeclareMathOperator{\opInformationContent}{H} \newcommand{\ICof}[1]{\opInformationContent[#1]} \newcommand{\xICof}[1]{\opInformationContent(#1)} \newcommand{\sicof}[1]{h(#1)} \DeclareMathOperator{\opEntropy}{H} \newcommand{\Hof}[1]{\opEntropy[#1]} \newcommand{\xHof}[1]{\opEntropy(#1)} \DeclareMathOperator{\opMI}{I} \newcommand{\MIof}[1]{\opMI[#1]} \DeclareMathOperator{\opTC}{TC} \newcommand{\TCof}[1]{\opTC[#1]} \newcommand{\CrossEntropy}[2]{\opEntropy(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opKale}{D_\mathrm{KL}} \newcommand{\Kale}[2]{\opKale(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opJSD}{D_\mathrm{JSD}} \newcommand{\JSD}[2]{\opJSD(#1 \MidSymbol[\Vert] #2)} \DeclareMathOperator{\opp}{p} \newcommand{\pof}[1]{\opp(#1)} \newcommand{\pcof}[2]{\opp_{#1}(#2)} \newcommand{\hpcof}[2]{\hat\opp_{#1}(#2)} \DeclareMathOperator{\opq}{q} \newcommand{\qof}[1]{\opq(#1)} \newcommand{\qcof}[2]{\opq_{#1}(#2)} \newcommand{\varHof}[2]{\opEntropy_{#1}[#2]} \newcommand{\xvarHof}[2]{\opEntropy_{#1}(#2)} \newcommand{\varMIof}[2]{\opMI_{#1}[#2]} \DeclareMathOperator{\opf}{f} \newcommand{\fof}[1]{\opf(#1)} \newcommand{\indep}{\perp\!\!\!\!\perp} \newcommand{\Y}{Y} \newcommand{\y}{y} \newcommand{\X}{\boldsymbol{X}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\w}{\boldsymbol{\theta}} \newcommand{\W}{\boldsymbol{\Theta}} \newcommand{\wstar}{\boldsymbol{\theta^*}} \newcommand{\D}{\mathcal{D}} \newcommand{\HofHessian}[1]{\opEntropy''[#1]} \newcommand{\specialHofHessian}[2]{\opEntropy''_{#1}[#2]} \newcommand{\HofJacobian}[1]{\opEntropy'[#1]} \newcommand{\specialHofJacobian}[2]{\opEntropy'_{#1}[#2]} \newcommand{\indicator}[1]{\mathbb{1}\left[#1\right]} $$

Bayesian Updating: A Quick Refresher

Bayesian updating is a way to revise our beliefs in light of new evidence. It works like this:

  1. Start with a prior belief \(\pof{\w}\) about something.

  2. Gather evidence \(\x\) related to that belief.

  3. Update the prior belief based on the evidence and its likelihood \(\pof{\x \given \w}\) to get a posterior belief \(\pof{\w \given \x}\) by conditioning on the evidence.

The last step makes use of Bayes’ rule:

\[ \pof{\w \given \x} = \frac{\pof{\x \given \w} \pof{\w}}{\pof{\x}}. \]

We can think of it like weather forecasting. You might start with a prior belief about the chance of rain tomorrow (maybe based on the season). Then, you look at the latest weather report (the evidence). Finally, you update your belief about the chance of rain based on that report, forming your posterior belief.

To measure how “different” two belief distributions \(\pof{\w}\) and \(\qof{\w}\) are, we can use something called the Kullback-Leibler (KL) divergence \(\Kale{\pof{\w}}{\qof{\w}}.\) You can think of it as the “distance” between two beliefs, although it’s not a distance in the usual geometric sense.

In an idealized world, conditioning on shared evidence reduces the KL divergence between different priors. Indeed, this is the case in the data limit given enough evidence (and assuming that the prior beliefs share support). But our experiment reveals a more nuanced reality and raises the question of under which conditions this holds in the small data case.

The Half-Circle Experiment: Visualizing Divergence

Let’s look at a simple experiment to see how this works in practice.

We present a simple toy example to illustrate the potential mechanism. While deliberately simplified, it suggests how more complex scenarios might exhibit similar dynamics.

We start with two different prior beliefs \(\pof{\w}\) and \(\qof{\w}\), where \(\w = (\theta_1, \theta_2)\). We want to learn two parameters \(\theta_1\) and \(\theta_2\). We choose them such that \(\pof{\w}\) is an upper half-circle and \(\qof{\w}\) is a lower half-circle, both centered at the origin. We use a product of two Gaussians to construct them. This also ensures that the priors share support (that is are non-zero everywhere just like Gaussian distributions).

Show the code
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import logsumexp

# Set matplotlib style
plt.style.use("ggplot")

# Note on area element:
# The area element is used to normalize the densities and calculate the KL divergence.
# It is the product of the step sizes in the x and y directions.
# The densities are defined on a grid, and the area element ensures that the densities 
# are properly normalized.
def normalize_log_density(log_density, area_element):
    """Normalize the log density in log space."""
    log_partition_constant = logsumexp(log_density) + np.log(area_element)
    return log_density - log_partition_constant

def gaussian_1D_log(x, mu, sigma):
    """Calculate the log density of the flat Gaussian for y-coordinate."""
    return -0.5 * ((x - mu) / sigma) ** 2 - np.log(sigma * np.sqrt(2 * np.pi))

# Resolution
num_steps = 400

# Meshgrid for visualization
x, x_step = np.linspace(-2, 2, num_steps, dtype=np.float64, retstep=True)
y, y_step = np.linspace(-2, 2, num_steps, dtype=np.float64, retstep=True)
X, Y = np.meshgrid(x, y)
R = np.sqrt(X**2 + Y**2)

area_element = x_step * y_step

# Calculate the log densities
log_density_upper_half = gaussian_1D_log(R, mu=1, sigma=0.1) + gaussian_1D_log(Y, mu=1, sigma=1)
log_density_lower_half = gaussian_1D_log(R, mu=1, sigma=0.1) + gaussian_1D_log(Y, mu=-1, sigma=1)

# Normalize the log densities in log space
log_density_upper_half = normalize_log_density(log_density_upper_half, area_element)
log_density_lower_half = normalize_log_density(log_density_lower_half, area_element)

# Convert log densities back to linear scale for visualization
density_upper_half = np.exp(log_density_upper_half)
density_lower_half = np.exp(log_density_lower_half)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 1+6/1.61))
cmap = "viridis"

# Upper half
contour_upper = axes[0].contourf(X, Y, density_upper_half, levels=50, cmap=cmap)
axes[0].set_title(r'Upper Half-Circle Prior $p(\theta)$')
lim = 1.5
axes[0].set_xlim(-lim, lim)
axes[0].set_ylim(-lim, lim)
axes[0].set_xlabel(r'$\theta_1$')
axes[0].set_ylabel(r'$\theta_2$')

# Lower half
contour_lower = axes[1].contourf(X, Y, density_lower_half, levels=50, cmap=cmap)
axes[1].set_title(r'Lower Half-Circle Prior $q(\theta)$')
axes[1].set_xlim(-lim, lim)
axes[1].set_ylim(-lim, lim)
axes[1].set_xlabel(r'$\theta_1$')
axes[1].set_ylabel(r'$\theta_2$')

fig.colorbar(contour_lower, ax=axes, label='Density')
plt.show()

Figure 1: Two different prior beliefs, visualized as half-circles.

As likelihood function, we assume that we draw 2D points drawn according to a Gaussian distribution around \(\w\) with a standard deviation of 0.1 in the \(\theta_1\) direction and 1 in the \(\theta_2\) direction. Now let’s introduce some “evidence” for \((0,0)\):

Show the code
# p((x', y') | \theta_1, \theta_2) = N(x'; \theta_1, 0.1) * N(y'; \theta_2, 1)
# Now for x=0, y=0, this is the same as
# p((0,0) | \theta_1, \theta_2) = N(0; \theta_1, 0.1) * N(0; \theta_2, 1) =
# = N(\theta_1; 0, 0.1) * N(\theta_2; 0, 1)
log_likelihood = gaussian_1D_log(-X, mu=0, sigma=0.1) + gaussian_1D_log(-Y, mu=0, sigma=1)

# Normalize the log likelihood in log space
# log_likelihood = normalize_log_density(log_likelihood, area_element)

# Convert log likelihood back to linear scale for visualization
likelihood = np.exp(log_likelihood)

# Plot the likelihood
plt.contourf(X, Y, likelihood, levels=50, cmap="viridis")
plt.title(r'Likelihood $p(x \mid \theta)$')
# plt.axis('equal')
lim = 2
plt.xlim(-lim, lim)
plt.ylim(-lim, lim) 
plt.xlabel(r'$\theta_1$')
plt.ylabel(r'$\theta_2$')
plt.colorbar(label='Density')
plt.show()

Figure 2: The likelihood function, representing the evidence.

When we update our priors with this evidence, we get the following posteriors:

Show the code
log_posterior_upper = log_likelihood * 1.0 + log_density_upper_half
log_posterior_lower = log_likelihood * 1.0 + log_density_lower_half

# Normalize the log posteriors in log space
log_posterior_upper = normalize_log_density(log_posterior_upper, area_element)
log_posterior_lower = normalize_log_density(log_posterior_lower, area_element)

# Plot the posteriors
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

posterior_upper = np.exp(log_posterior_upper)
posterior_lower = np.exp(log_posterior_lower)

# Upper half
contour_upper = axes[0].contourf(X, Y, posterior_upper, levels=50, cmap="viridis")
axes[0].set_title(r'Upper Half-Circle Posterior $p(\theta \mid x)$')
lim = 1.5
axes[0].set_xlim(-lim, lim)
axes[0].set_ylim(-lim, lim)
axes[0].set_xlabel(r'$\theta_1$')
axes[0].set_ylabel(r'$\theta_2$')

# Lower half
contour_lower = axes[1].contourf(X, Y, posterior_lower, levels=50, cmap="viridis")
axes[1].set_title(r'Lower Half-Circle Posterior $q(\theta \mid x)$')
axes[1].set_xlim(-lim, lim)
axes[1].set_ylim(-lim, lim)
axes[1].set_xlabel(r'$\theta_1$')
axes[1].set_ylabel(r'$\theta_2$')
fig.colorbar(contour_lower, ax=axes, label='Density')
plt.show()

Figure 3: The posterior beliefs after updating with the evidence.

Now, here’s the surprising part. We measured the KL divergence between the priors and between the posteriors and found that the KL divergence between the posteriors is larger than the KL divergence between the priors. This means that more evidence actually increased the divergence between the beliefs!

Show the code
def kl_divergence_numerical_log(p_log, q_log, area_element=area_element):
    """Numerically approximate the KL divergence between two distributions in log space."""
    # Calculate the KL divergence in log space
    return np.sum(np.exp(p_log) * (p_log - q_log)) * area_element

# Compute the KL divergence from the upper half-circle prior to the lower half-circle prior in log space
kl_div_upper_to_lower = kl_divergence_numerical_log(log_density_upper_half, log_density_lower_half)

# Compute the KL divergence from the lower half-circle prior to the upper half-circle prior in log space
kl_div_lower_to_upper = kl_divergence_numerical_log(log_density_lower_half, log_density_upper_half)

# Compute the KL divergence from the upper half-circle posterior to the lower half-circle posterior in log space
kl_div_upper_to_lower_posterior = kl_divergence_numerical_log(log_posterior_upper, log_posterior_lower)

# Compute the KL divergence from the lower half-circle posterior to the upper half-circle posterior in log space
kl_div_lower_to_upper_posterior = kl_divergence_numerical_log(log_posterior_lower, log_posterior_upper)

# Create a table with the KL divergences
import pandas as pd

kl_divergences = pd.DataFrame({
    'Prior': [kl_div_upper_to_lower, kl_div_lower_to_upper],
    'Posterior': [kl_div_upper_to_lower_posterior, kl_div_lower_to_upper_posterior]
}, index=[r'$D_\text{KL}(p \Vert q)$', r'$D_\text{KL}(q \Vert p)$'])

from IPython.display import Markdown
Markdown(kl_divergences.to_markdown())

Table 1: KL Divergence Before and After Evidence

Prior Posterior
\(D_\text{KL}(p \Vert q)\) 0.81179 1.48397
\(D_\text{KL}(q \Vert p)\) 0.81179 1.48397

We can also measure the divergence between the priors and between the posteriors using the Jensen-Shannon divergence. The Jensen-Shannon divergence is a symmetric measure of the divergence between two distributions.

Show the code
def js_divergence_numerical_log(p_log, q_log, area_element=area_element):
    """Numerically approximate the Jensen-Shannon divergence between two distributions in log space."""
    # Calculate the mixture distribution M = (P + Q)/2 in log space
    log_m = np.logaddexp(p_log, q_log) - np.log(2)
    
    # Calculate JSD = 0.5 * (KL(P||M) + KL(Q||M))
    kl_p_m = kl_divergence_numerical_log(p_log, log_m)
    kl_q_m = kl_divergence_numerical_log(q_log, log_m)
    
    return 0.5 * (kl_p_m + kl_q_m)

# Compute JS divergence between priors
js_div_prior = js_divergence_numerical_log(log_density_upper_half, log_density_lower_half)

# Compute JS divergence between posteriors  
js_div_posterior = js_divergence_numerical_log(log_posterior_upper, log_posterior_lower)

# Create a table comparing the divergences
divergences = pd.DataFrame({
    'Priors': [js_div_prior],
    'Posteriors': [js_div_posterior]
}, index=['Jensen-Shannon Divergence'])

from IPython.display import Markdown
Markdown(divergences.to_markdown())

Table 2: Jensen-Shannon Divergence Before and After Evidence

Priors Posteriors
Jensen-Shannon Divergence 0.18083 0.319379

Both the KL divergence and the Jensen-Shannon divergence between the posteriors is larger than the KL divergence and Jensen-Shannon divergence between the priors. This means that after seeing the same evidence, the beliefs have become more different, not less!

This is where things get interesting—and have perhaps become a bit counterintuitive (at least for me)—and raises an important question: under what conditions might Bayesian agents exhibit belief divergence? We can hypothesize that the concentration of initial priors and the nature of the evidence play a crucial role.

Understanding the Divergence: Priors and the Backfire Effect

It’s worth highlighting that these findings do not imply any flaw in Bayesian inference or an absence of rational thinking. Bayesian updating does exactly what it should: it reflects one’s starting assumptions and adjusts them using incoming data. Yet, when individuals (or agents) begin with starkly different and concentrated—perhaps thus “entrenched”—prior distributions, it can take considerable evidence to overcome those initial differences.

In the literature on polarization, this phenomenon has sometimes been connected to what was called the backfire effect (Nyhan and Reifler 2010): the idea that additional information, especially when it contradicts a well-established worldview, may reinforce rather than reduce prior bias. While recent research challenges how common such backfire effects are (Swire-Thompson, DeGutis, and Lazer 2020), our mathematical framework shows that temporary belief divergence can occur even with perfectly rational Bayesian updating.

Earlier psychological studies had described at least two versions of potential backfire effects:

  1. Worldview Challenge: When new evidence appears to threaten deeply cherished values or identities.
  2. Familiarity Effect: Where repeated exposure to ideas (even in corrections) might increase their cognitive accessibility.

However, more recent work suggests these effects may be less robust than initially thought, highlighting the importance of careful experimental design and measurement when studying belief updating (Swire-Thompson, DeGutis, and Lazer 2020).

This makes our mathematical demonstration of potential belief divergence particularly interesting, as it shows how apparent “backfire-like” effects between differing beliefs could emerge purely from rational Bayesian updating, without requiring special psychological mechanisms.

Echoes in the Real World: Examples and Implications

The polarization example we’ve examined already offers potential insights into real-world phenomena. It suggests a simple reason why presenting more facts often fails to resolve disagreements and can even exacerbate them. Several real-world examples show patterns that resemble the belief divergence we explored:

Before looking at these examples, it’s crucial to reiterate that our model is a highly simplified representation of reality. It captures a specific dynamic of Bayesian updating but omits a vast array of psychological, social, and cognitive factors that contribute to phenomena like political polarization. The examples below should be viewed as illustrations of how this simplified dynamic might* manifest in more complex settings, not as complete explanations.*

  • Political Polarization: Consider the charged debates surrounding the COVID-19 pandemic, particularly regarding measures like social distancing and mask mandates. Or, more recently, the polarized discussions around the wars in Ukraine and Gaza. While many factors drive political polarization—including echo chambers, confirmation bias, strategic misinformation, and media fragmentation—our observation suggests that some initial divergence might arise naturally even with perfectly rational actors. This offers a simpler baseline explanation for certain aspects of polarization, though reality likely involves the complex interplay of many psychological and social forces.

  • Conspiracy Theories: The persistence and occasional strengthening of conspiracy theories like QAnon, despite debunking efforts, involves multiple mechanisms. While social dynamics, motivated reasoning, and identity-protective cognition play major roles, the Bayesian perspective hints at how initial belief divergence could emerge even without invoking these additional factors. Of course, in practice, these psychological and social elements significantly amplify the effect.

  • Climate Change Denial: The ongoing debates around climate change provide another example. Despite overwhelming scientific consensus, climate change denial remains a significant issue in some circles. Our model suggests that even when presented with the same evidence, individuals with differing prior beliefs might initially diverge further, potentially contributing to the persistence of denialist views.

Research Corroboration

These observations are not isolated incidents but are consistent with a growing body of research across various fields:

  • Moral Foundations Theory: Jonathan Haidt’s seminal work, The Righteous Mind (Haidt 2012), illuminates how deeply ingrained moral intuitions can shape our interpretation of facts. This makes it challenging to change someone’s mind through limited evidence alone, especially when that evidence conflicts with their moral framework. Haidt’s research suggests that moral values act as powerful priors through which individuals process information, often leading to divergent interpretations of the same data.

  • Misinformation and Disinformation Studies: Studies on the spread of misinformation, such as those by Nyhan and Reifler (Nyhan and Reifler 2010), demonstrate that attempts to correct political misperceptions can sometimes backfire, leading to a strengthening of the initial false belief. Their research highlights the challenges of correcting misinformation, particularly when corrections challenge people’s preexisting views. However, as noted earlier, more recent work suggests such backfire effects may be less common than initially thought (Swire-Thompson, DeGutis, and Lazer 2020).

In all these cases, we observe a recurring pattern: shared information, rather than fostering consensus, can actually amplify divergence, at least initially.

In this post we have seen that divergence is not necessarily a sign of irrationality but can also arise as a natural consequence of differing prior beliefs and the process of Bayesian updating.

The Mathematics of Divergence

What we’ve demonstrated above is that the KL and Jensen-Shannon divergences between posteriors can exceed the divergences between priors despite both agents having observed identical evidence. This mathematical finding aligns with research showing that belief polarization can be entirely rational within a probabilistic framework (Jern, Chang, and Kemp 2014).

The key insight is that Bayesian updating operates relative to prior beliefs:

  1. Each agent updates their beliefs consistently with Bayes’ rule
  2. The same evidence can strengthen different regions of belief for different priors
  3. The observed divergence occurs even when both agents process information rationally

In particular the worldview backfire effect could occur even in perfectly rational agents as we have seen in the example above. This showcases how perfectly rational updates in a Bayesian sense could still look puzzling from the outside.

While the aforementioned “Belief Polarization Is Not Always Irrational” (Jern, Chang, and Kemp 2014) showed that observing the same data can drive two participants’ beliefs apart in a binary hypothesis test, our example here goes a step further by shifting from discrete/binary (e.g., “Is the parameter high or low?”) to continuous parameter spaces, illustrating polarization in the shape of full distributions rather than simple yes/no choices.

Convergence in the Long Run

This “backfire” or short-term belief divergence still gives way to convergence in the long run for rational actors. The same Bayesian machinery that initially magnifies differences will eventually, with enough pertinent data, push posteriors toward the ground truth when the model is well-specified. In formal terms:

\[ \lim_{N \to \infty} \pof{\theta \mid x_1, \ldots, x_N} = \delta(\theta - \theta^*) \]

where \(\theta^*\) is the true (data-generating) parameter. More generally, if the two belief distributions share support (are non-negative in the same areas), the posteriors in the infinite data limit will also agree.

We can visualize this using our example by varying the strength of the evidence at \((0, 0)\):

Show the code
import matplotlib.animation as animation

def create_posterior_animation(log_prior_upper, log_prior_lower, log_likelihood, multipliers):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    fig.suptitle('Evolution of Posteriors with Increasing Evidence Strength')
    
    # Initialize plots
    im1 = ax1.imshow(np.exp(log_prior_upper), extent=[-2, 2, -2, 2], cmap="viridis")
    im2 = ax2.imshow(np.exp(log_prior_lower), extent=[-2, 2, -2, 2], cmap="viridis")
    
    ax1.set_title(r'Upper Half-Circle Posterior $p(\theta \mid x^\lambda)$')
    ax2.set_title(r'Lower Half-Circle Posterior $q(\theta \mid x^\lambda)$')
    
    # Add colorbars
    fig.colorbar(im1, ax=(ax1, ax2))

    js_div_log = {}
    
    def update(multiplier):
        # Compute posteriors with current multiplier
        log_posterior_upper = log_likelihood * multiplier + log_density_upper_half
        log_posterior_lower = log_likelihood * multiplier + log_density_lower_half

        # Normalize the log posteriors in log space
        log_posterior_upper = normalize_log_density(log_posterior_upper, area_element)
        log_posterior_lower = normalize_log_density(log_posterior_lower, area_element)

        # Update plots
        im1.set_array(np.exp(log_posterior_upper))
        im2.set_array(np.exp(log_posterior_lower))

        # TODO: update colorbar somehow?
        
        # Update title with current multiplier and JS divergence
        js_div = js_divergence_numerical_log(log_posterior_upper, log_posterior_lower)
        js_div_log[multiplier] = js_div
        fig.suptitle(f'Evolution of Posteriors (Evidence Strength $\\lambda$: {multiplier:.1f}, JS Div: {js_div:.3f})')
        
        return [im1, im2]
    
    anim = animation.FuncAnimation(
        fig, update,                 
        frames=multipliers, 
        interval=100, 
        blit=True,
        repeat=True)
    plt.close(fig)
    return anim, js_div_log

# Create animation with multipliers from 0 to 20
multipliers = np.geomspace(0.001, 1000, 60)
anim, js_div_log = create_posterior_animation(log_density_upper_half, log_density_lower_half, log_likelihood, multipliers)

# # Save animation as HTML5 video for display in notebook
from IPython.display import HTML
HTML(anim.to_html5_video())
Show the code
plt.plot(js_div_log.keys(), js_div_log.values())
plt.xlabel("Evidence Strength")
plt.ylabel("Jensen-Shannon Divergence")
plt.xscale("log")
plt.show()

Figure 5: Jensen-Shannon Divergence as a Function of Evidence Strength.

In Figure 5, we see how the divergence between the distribution first increases and then goes to 0 as the amount of evidence increases.

The key question thus becomes: how much evidence is needed to overcome initial differences in priors?

This depends on:

  • The concentration of initial priors
  • The quality/clarity of the evidence
  • How directly the evidence addresses key uncertainties

The Role of Ground Truth

An important nuance emerges when we consider what happens if one of the agents starts with beliefs concentrated at the true parameters. In this case, we observe monotonic convergence of the other agent’s beliefs toward the truth, without the temporary divergence we saw earlier.

Show the code
# Create a fixed "ground truth" posterior by applying strong evidence to one prior
log_fixed_posterior = log_density_upper_half + 1000 * log_likelihood

log_fixed_posterior = normalize_log_density(log_fixed_posterior, area_element)

# Function to compute KL divergence between fixed posterior and evolving posterior
def compute_kl_div_with_fixed(log_evolving_prior, log_fixed_posterior, log_likelihood, multiplier):
    # Compute log posteriors
    log_evolving_posterior = log_evolving_prior + multiplier * log_likelihood

    # Normalize the log posteriors in log space
    log_evolving_posterior = normalize_log_density(log_evolving_posterior, area_element)

    # Compute KL divergence using numerical log method
    return kl_divergence_numerical_log(log_evolving_posterior, log_fixed_posterior)

# Compute divergence for range of evidence strengths
kl_div_fixed = {}
for multiplier in multipliers:
    kl_div = compute_kl_div_with_fixed(
        log_density_lower_half, 
        log_fixed_posterior, 
        log_likelihood, 
        multiplier
    )
    kl_div_fixed[multiplier] = kl_div

# Plot the divergence curve
plt.figure(figsize=(10, 6))
plt.plot(kl_div_fixed.keys(), kl_div_fixed.values())
plt.xlabel("Evidence Strength")
plt.ylabel("KL Divergence")
plt.xscale("log")
# plt.yscale("log")
plt.title("Convergence to Ground Truth")
plt.show()

Figure 6: KL Divergence as a Function of Evidence Strength vs. Fixed Ground Truth.

This observation highlights that the divergence effect we demonstrated requires both agents to start with beliefs that are “wrong” (even if rationally held given their prior information). When an agent has accurate beliefs, their position acts as an attractor for other rational agents observing the same evidence. This aligns with our intuition about scientific consensus—when strong evidence supports a clear truth, rational agents tend to converge toward it.

This doesn’t undermine our earlier observations about potential divergence—rather, it enriches our understanding in two ways:

  1. Practical Implications: In real-world scenarios, we rarely start with perfect knowledge. Most debates involve multiple parties working with incomplete information and imperfect models.

  2. Path Dependence: The journey to truth may involve temporary increases in disagreement, even when ultimate convergence is guaranteed. This “divergence before convergence” pattern might help explain why scientific and social debates anecdotally often become more heated before reaching resolution.

This insight about ground truth provides hope: while beliefs may temporarily diverge, sufficient high-quality evidence will eventually lead rational agents to converge on accurate beliefs. The challenge lies in gathering and effectively communicating that evidence.

Open Questions and Research Directions

This note raises several important questions for future research:

Theoretical Foundations

  1. What are the necessary and sufficient conditions for belief divergence to occur in the finite data case?
  2. Can we characterize the space of divergence-inducing evidence for specific classes of priors and likelihoods?
  3. How do different divergence measures, beyond KL and JS divergence, behave in this context? For instance, one could explore the Hellinger distance or total variation distance.

Empirical Investigation

  1. How prevalent is this phenomenon in more realistic scenarios, beyond our simplified example?
  2. What role does model misspecification play in exacerbating or mitigating belief divergence?
  3. Can we design experiments to measure these effects in human belief updating, and how do they relate to existing psychological theories like the backfire effect?

But the most important question is how these findings translate to more complex, real-world settings. The insight provided by our simplified model is a starting point, but it would be essential to explore how these dynamics interact with other factors that shape human beliefs, such as social influence, motivated reasoning, and cognitive biases. Bridging the gap between such a theoretical model and the intricacies of human cognition has always been the key challenge.

While there are no easy solutions to the problem of polarization and belief divergence, understanding the dynamics revealed by our Bayesian framework offers valuable insights.

Key Takeaways:

  1. The Power of Priors: Recognizing the significant influence of prior beliefs can foster more empathetic and understanding approaches to communication and persuasion. We must acknowledge that individuals interpret new information through the lens of their existing beliefs. Acknowledging the power of priors is crucial for designing effective communication strategies that resonate with diverse audiences.

  2. Divergence as a Natural Step: We should anticipate that presenting evidence may initially increase divergence before leading to convergence. This is particularly true when dealing with deeply held beliefs or when the evidence is complex or ambiguous. Recognizing that divergence can be a natural part of the belief-updating process can help us maintain patience and persistence in the face of initial resistance.

  3. The Need for Nuanced Communication: This highlights the importance of more nuanced approaches to communication and persuasion. Simply presenting facts is often insufficient. Instead, we must consider the role of priors, the potential for backfire effects, and the need to tailor our communication to the specific audience and context. Effective communication requires a deep understanding of the audience’s existing beliefs, values, and cognitive biases.

A Hopeful Note

By understanding these dynamics, we can strive to engage in more constructive dialogue, build bridges across divides, and navigate the complexities of a world where beliefs often diverge, even in the face of shared evidence. While our model offers a simplified view, it highlights a fundamental aspect of rational belief updating that may contribute to real-world polarization.

The Bayesian perspective reminds us that convergence is possible, but it requires patience, high-quality evidence, and a willingness to engage with differing viewpoints in a thoughtful and respectful manner. The aspirational message is thus to not give up prematurely when debating conversely but keep providing facts, recognizing that the path to consensus may be longer and more complex than we initially assume.

One shouldn’t give up engaging in grounded debates. The other side might be perfectly rational, yet from one’s own vantage point one might still observe stronger polarization at first. While the path to consensus may be long and winding, these insights can guide us toward more effective strategies for navigating disagreements and fostering understanding in an increasingly polarized world. Further research, incorporating the complexities of human cognition and social dynamics, will be crucial to fully realizing this potential.

References

Haidt, J. 2012. The Righteous Mind: Why Good People Are Divided by Politics and Religion. Allen Lane.

Jern, Alan, Kai-Min K Chang, and Charles Kemp. 2014. “Belief Polarization Is Not Always Irrational.” Psychological Review 121 (2): 206.

Nyhan, Brendan, and Jason Reifler. 2010. “When Corrections Fail: The Persistence of Political Misperceptions.” Political Behavior 32 (2): 303–30.

Swire-Thompson, Briony, Joseph DeGutis, and David Lazer. 2020. “Searching for the Backfire Effect: Measurement and Design Considerations.” Journal of Applied Research in Memory and Cognition 9 (3): 286–99.