Simulation-Based Calibration: The Missing Test in Your Bayesian Workflow

In a recent post on generative modeling I mentioned a stress test that I find genuinely sobering: across sixteen synthetic scenarios with known ground truth, eight produced material attribution errors while the computational diagnostics looked perfectly healthy. R-hat under 1.01. Tail ESS well above 400. Zero divergences. And the channel ROIs were wrong.

That result isn’t a quirk of my framework. It’s a structural fact about the relationship between MCMC diagnostics and inference quality. The diagnostics I described in Read the Diagnostics First — R-hat, ESS, divergences — tell you something specific and important: they guarantee that your samples are a valid draw from the posterior you specified. They say nothing about whether that posterior correctly recovers parameters from data the model itself generated.

The test that checks the latter is Simulation-Based Calibration (SBC). It’s been in the literature since Cook, Gelman & Rubin (2006) and was formalized by Talts et al. (2018). It belongs in the standard Bayesian workflow. Most MMM practitioners have never run it.

The idea: a valid posterior passes a rank test

The core insight is this. If your posterior is correct — meaning it properly represents uncertainty about parameters given data generated by the model — then the following should be true: take a parameter value $\theta^\*$ drawn from the prior, generate a dataset $y$ from the likelihood at that $\theta^\*$ , and fit the posterior $p(\theta \mid y)$ . The rank of $\theta^\*$ within the posterior samples should be uniformly distributed over $\{0, 1, \ldots, S\}$ where $S$ is the number of posterior draws.

Why? Because a calibrated posterior is, on average, centered at the truth. Sometimes $\theta^\*$ will fall in the lower tail (low rank), sometimes in the upper tail (high rank), and most often near the middle — but across many simulated datasets, no region of the rank distribution should be preferred over any other. Uniformity is the fingerprint of a correctly calibrated inference procedure.

Formally: if $\theta^\* \sim p(\theta)$ and $y \sim p(y \mid \theta^\*)$ , then the rank statistic

$r = \{\theta_s \leq \theta^\* : \theta_s \sim p(\theta \mid y),\ s = 1, \ldots, S\}$

should satisfy $r \sim \text{Uniform}(0, S)$ when marginalizing over both $\theta^\*$ and $y$ .

That’s the whole test. You simulate many pairs $(\theta^\*, y)$ , fit the model each time, compute the rank, and check whether the resulting histogram of ranks is flat.

What non-uniform ranks tell you

The shape of the deviation from uniformity is diagnostic. Each failure mode has a different signature, and each one points at a different problem.

Trough shape (∪) — the posterior is over-dispersed. The true value lands in the middle too often, which means the posterior is tighter than it should be. Your credible intervals are too narrow; you’re falsely confident. In MMMs this often surfaces in adstock parameters: the model is confidently reporting a 0.7 retention rate with a tight interval when the true generating rate was anywhere in $[0.4, 0.9]$ . The sampler is confident; it’s just confidently wrong about how uncertain it should be.

Arch shape (∩) — the posterior is over-concentrated. The true value lands in the tails too often. Intervals are too wide. This can happen when you’ve over-regularized with a very diffuse prior that’s pulling the posterior away from where the data should be pointing it.

Shifted shape — the posterior is biased. Ranks pile up on the left (systematic underestimation) or on the right (systematic overestimation). A saturated saturation curve where the half-saturation parameter is near-collinear with baseline often produces this pattern: the sampler consistently underestimates one and overestimates the other.

An important note: these failures are invisible to R-hat, ESS, and divergences. Those diagnostics can look perfect while the rank histogram is definitively non-uniform. The test is measuring a different property — not “did the chains mix?” but “does the inference recover the truth in expectation?”

Running SBC in practice

The basic loop is straightforward to write. In PyMC, it looks like this:

import pymc as pm
import numpy as np
import arviz as az
from scipy.stats import uniform

def run_sbc(model_fn, n_simulations=500, n_draws=500, n_tune=1000):
    """
    model_fn: callable that returns a (pm.Model, observed_var_name) tuple
              given observed data.
    Expects model_fn to expose a pm.Data container for the observations.
    Returns an array of ranks, shape (n_simulations, n_params).
    """
    all_ranks = []

    for _ in range(n_simulations):
        # Step 1: draw from prior
        with model_fn(observed=None) as model:
            prior = pm.sample_prior_predictive(samples=1)

        theta_star = {k: prior.prior[k].values[0, 0] for k in prior.prior}
        y_sim = prior.prior_predictive["obs"].values[0, 0]

        # Step 2: fit posterior on simulated data
        with model_fn(observed=y_sim) as model:
            trace = pm.sample(draws=n_draws, tune=n_tune, progressbar=False)

        # Step 3: compute ranks
        for param in theta_star:
            posterior_draws = trace.posterior[param].values.flatten()
            rank = np.sum(posterior_draws <= theta_star[param])
            all_ranks.append({"param": param, "rank": rank, "n_draws": len(posterior_draws)})

    return all_ranks

ArviZ then gives you the visualization directly:

# After collecting ranks across simulations
az.plot_bfmi(trace)  # for energy diagnostics
# For SBC specifically, plot rank histograms:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, len(params), figsize=(4 * len(params), 3))
for ax, param in zip(axes, params):
    ranks = [r["rank"] for r in all_ranks if r["param"] == param]
    ax.hist(ranks, bins=20, edgecolor="black")
    ax.axhline(len(ranks) / 20, color="red", linestyle="--", label="uniform")
    ax.set_title(param)
plt.tight_layout()

In practice, 200–500 simulations is usually enough to detect severe miscalibration; 1000+ is useful when you want confidence about mild distortions.

What I find in MMMs

The parameters most likely to fail SBC in a marketing-mix model are the ones you’d guess from first principles: adstock retention rates and saturation half-saturation constants. Both are weakly identified when media spend has low variation or the time series is short — which is to say, almost always.

What SBC reveals is which failure mode you’re in. A retention rate that fails with an arch distribution is a parameter your model is too confident about; you need to widen the prior or accept that the posterior width should be wider. A half-saturation parameter that fails with a bias (shifted left) is one where the likelihood surface has a systematic asymmetry — often because the observed spend range sits entirely in the linear region, so the model never sees the curve turn over and is extrapolating.

Both of these failures look fine in R-hat and ESS. The sampler mixed just fine. It’s the inference that’s wrong. That’s exactly the distinction that SBC is built to surface, and that the MCMC diagnostics aren’t.

Framing SBC in the workflow

The place SBC belongs is between “write and prior-check the model” and “fit on real data.” It’s the pre-flight that tests whether the engine actually works, before you put passengers aboard.

The sequence I now run for any new model structure:

Write the generative story; check with prior predictive simulation.
Run SBC: verify that rank histograms are approximately uniform for all parameters of interest.
If SBC fails: fix the model (reparameterize, adjust priors, or accept wider posteriors) and re-run SBC.
Fit on real data.
Check MCMC diagnostics (R-hat, ESS, divergences).
Run posterior predictive checks.
Calibrate against experimental lift tests.

Step 2 is the one most practitioners skip because it’s the most expensive upfront. But the cost is actually concentrated in the right place: you pay it once per model architecture, not per dataset. Once you’ve established that your adstock + saturation structure recovers parameters correctly under simulation, you can fit it on a dozen clients’ data without repeating the SBC. The architecture is certified; the question becomes only the fit.

The sixteen-scenario stress test I referenced in the generative modeling post is essentially a coarser version of this: simulating data from specific known configurations and checking whether the fitted model recovers them. SBC is that test made rigorous and automated — converting a qualitative “does it roughly work?” into a quantitative “are the rank distributions consistent with uniform?”

The honest answer is that most MMM practitioners, myself included, don’t do this as rigorously as they should. It takes computation; it requires thinking carefully about what “recovering the parameter” even means for a weakly-identified curve; and the failure modes, when you find them, demand investigation rather than a quick fix. But the alternative — deploying posteriors whose calibration you haven’t checked, and then calibrating against experiments you planned based on those posteriors — is the kind of compounded uncertainty that makes measurement programs quietly drift from the truth over time.

Run the ranks. See if they’re flat. If they aren’t, you have something specific to fix.

SBC is introduced formally in Talts, Betancourt, Simpson, Vehtari & Gelman (2018), “Validating Bayesian Inference Algorithms with Simulation-Based Calibration,” arXiv:1804.06788. The earlier formulation is Cook, Gelman & Rubin (2006), “Validation of Software for Bayesian Models Using Posterior Quantiles,” Journal of Computational and Graphical Statistics 15(3). The prior/posterior predictive check workflow is from Gelman et al. (2020), “Bayesian Workflow,” arXiv:2011.01808. Related posts: All Models Are Wrong, So Make Yours Generative, Read the Diagnostics First.