Closed-Loop Measurement & Calibration

A comprehensive guide to integrating measurement into Marketing Mix Modeling. The Bayesian MMM and geo-lift experiments are not competing paradigms - they are complementary nodes in one inference graph. This page lays out how to wire them into a single self-correcting cycle that compounds learning over time.

Why this page exists

Most teams already run an MMM and a geo-lift program - and treat them as separate budgets, separate timelines, and separate answers. The closed-loop framework wires them together: the MMM chooses which experiments to run, and the experiments calibrate the next MMM. Each cycle tightens the parts of the picture that matter most for budget decisions.

The entire loop described on this page is operated from the platform: the Program page shows where you are in the measurement cycle, and the Experiments page hosts the priority matrix and the experiment design studio. The agent workspace can also run the planning tools conversationally—ask it to compute_experiment_priorities or design_experiment_plan and it works through the same machinery in chat.

No usable history? The model-free loop

Everything on this page assumes a fitted MMM anchoring the cycle. When there is no model to anchor it — a new brand, a new market, or history too confounded to fit — the continuous-learning loop runs the same design → readout → refit rhythm model-free: it learns a response surface directly from designed geo experiments (and can start from lift tests you have already run), then hands its readouts back to a full MMM once usable history accumulates.

The Two-Paradigm Problem

The MMM is the system of record for "what is each channel worth." It produces a clean ROI per channel, and the planning team allocates budget against those numbers. Separately, the team runs geo-lift experiments - paid search holdouts, CTV market tests, audio matched-market trials - each producing its own causal estimate. Sometimes the two agree; sometimes they don't. When they disagree, someone picks the source they trust more for that channel and moves on.

MMM System

"CTV ROI 1.4x"

Tight credible interval. Joint over all channels. Identified by observational variation - which means entangled when channels move together.

⚡

Experiment System

"CTV ROI 2.1x"

Wider interval. Single-channel scope. Causally identified by the randomization itself - so the estimate isn't entangled with the rest of the marketing plan.

The structural problem is not that either system is bad - both are well-built. The problem is that neither is wired into the other. The MMM doesn't know which channels need experimental backup; the experiments don't feed back into the next MMM fit. So the team keeps paying for both and keeps accepting whichever number the analyst trusted that week.

A tight number is not a true number

An MMM produces tight intervals when the model is well-regularized and the historical data tells a consistent story. But "consistent" is not the same as "correct" - when channels move in lockstep (national TV and digital flighted together; promo periods overlapping with seasonality), the model cannot tell them apart, and a regularizing prior produces a tight number anyway, anchored to an arbitrary point along an unidentified ridge. Confidently wrong is a real failure mode.

Two Complementary Objectives

Once you accept that experiments are the calibration tool, you still face the question of which experiments. Two distinct objectives emerge - they are correlated but not identical, and a good framework integrates both. The split mirrors the two families of design utilities - information-based and decision-based - unified in Chaloner & Verdinelli's (1995) review of Bayesian experimental design.

Epistemic - reduce uncertainty

Quantify posterior entropy per channel
Pick experiments that maximally collapse the posterior
Prioritize where beliefs are least grounded
Metric: Expected Information Gain (EIG)

Instrumental - improve decisions

Map ROI uncertainty to budget decision quality
Prioritize where errors are most costly
Quantify the dollar value of resolving each uncertainty
Metric: Expected Value of Information (EVOI)

A boutique channel may have huge posterior variance (high epistemic priority) but receive 1% of spend (low decision stakes). A workhorse channel may have moderate variance but receive 30% of spend, making any uncertainty extraordinarily costly. EIG and EVOI together produce a priority map that respects both.

Bayesian Foundations: Prior, Likelihood, Posterior

Bayesian inference is the discipline of updating beliefs in the face of new evidence. Three objects appear repeatedly in this guide:

Prior $p(\theta)$ - what you believe about an unknown parameter $\theta$ before seeing experimental data. For us, this is what the MMM tells us about a channel's ROI.
Likelihood $p(\hat{y} \mid \theta)$ - the probability of observing the experimental outcome $\hat{y}$ if the truth were $\theta$.
Posterior $p(\theta \mid \hat{y})$ - the updated belief about $\theta$ after seeing $\hat{y}$.

$$p(\theta \mid \hat{y}) \;=\; \frac{p(\hat{y} \mid \theta)\, p(\theta)}{p(\hat{y})} \;\propto\; \underbrace{p(\hat{y} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$ (1)

For the rest of this guide, $\theta_k$ denotes a channel-level causal parameter (typically ROI or elasticity) for channel $k$, and $\hat{y}_k$ denotes the noisy estimate produced by an experiment. The MMM provides $p(\theta_k)$. The experiment provides $p(\hat{y}_k \mid \theta_k)$. The product is what we want.

The Gaussian Conjugate Update

When the prior is Gaussian and the experiment likelihood is Gaussian (the standard assumption for geo-lift difference-in-differences), the math simplifies dramatically. With prior $\theta \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and likelihood $\hat{y} \mid \theta \sim \mathcal{N}(\theta, \sigma_e^2)$, the posterior is also Gaussian:

$$\theta \mid \hat{y} \;\sim\; \mathcal{N}\!\left(\mu_{\text{post}},\; \sigma_{\text{post}}^2\right)$$ $$\sigma_{\text{post}}^{-2} \;=\; \sigma_0^{-2} + \sigma_e^{-2}, \qquad \mu_{\text{post}} \;=\; \sigma_{\text{post}}^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{\hat{y}}{\sigma_e^2}\right)$$ (2)

Plain English

Precision (the inverse of variance) is additive. The posterior precision is the prior precision plus the experiment's precision. The posterior mean is a precision-weighted average of the prior mean and the experimental estimate - whichever is more precise pulls the posterior more strongly toward itself.

Equation (2) is the workhorse for almost everything that follows. EIG, calibration, and the bridge from frequentist estimates all reduce to applications of this single update.

Decision Structure

Let $\theta_k$ denote the true causal ROI for channel $k \in \{1, \ldots, K\}$. The MMM produces a joint posterior $p(\boldsymbol{\theta}) = p(\theta_1, \ldots, \theta_K \mid y, X, \mathcal{M})$. The planner selects an experimental portfolio $A \subseteq \{1, \ldots, K\}$ together with a design $d_k$ for each chosen channel:

$$A \;=\; \{(k_1, d_1),\, (k_2, d_2),\, \dots\}, \qquad d \;=\; \{n_{\text{test}},\, \Delta_{\text{spend}},\, \text{duration},\, \ldots\}$$ $$\text{subject to}\quad |A| \le C_{\text{ops}},\quad \mathrm{cost}(A) \le B$$ (3)

Each experiment produces a noisy estimate of the true causal effect. For a geo-lift difference-in-differences design, this is approximately Gaussian:

$$\hat{\tau}_k \mid \tau_k, d \;\sim\; \mathcal{N}\!\left(\tau_k,\; \sigma_{\text{exp},k}^2(d)\right), \qquad \sigma_{\text{exp},k}^2(d) \;\approx\; \frac{2\, s^2_{\text{geo},k}}{n\, T} \cdot \frac{1}{\Delta^2_{\text{spend},k}}$$ (4)

where $s^2_{\text{geo},k}$ is the residual geo-week variance from the pre-period, $n$ is geos per arm, and $T$ is the test duration in weeks. More geos, longer tests, and bigger spend deltas all shrink experimental noise.

Deep diveWhere the experiment-noise formula in Eq. (4) comes from, and its limits

Where Eq. (4) comes from, and its limits: it is the variance of a two-arm DiD mean contrast ($2 s^2_{\text{geo},k}/nT$ for independent geo-weeks), divided by $\Delta^2_{\text{spend},k}$ to convert a KPI-scale lift into a per-dollar effect. It assumes independent geo-week residuals — under serial correlation, multiply by the AR(1) design effect $D(T,\rho_g)$ defined in the design section below. The shipped designer (planning/design.py) handles the same problem empirically rather than analytically: it calibrates its analytic DiD power curve against a sliding-window placebo distribution from the pre-period, which absorbs the realized autocorrelation. Treat Eq. (4) as a planning approximation; the placebo calibration and the pre-experiment simulation in Step 5 are the checks that catch what it misses.

Expected Information Gain (EIG)

For a given channel $k$ and design $d$, the Expected Information Gain is the expected KL divergence from the prior to the updated posterior, integrated over what experimental outcome we might see - Lindley's (1956) classical measure of the information provided by an experiment. Equivalently, it is the mutual information between the unknown parameter and the experimental outcome:

$$\mathrm{EIG}(k, d) \;=\; \mathbb{E}_{\hat{y}_k}\!\left[\, \mathrm{KL}\!\left(\, p(\theta_k \mid \hat{y}_k, d) \,\middle\|\, p(\theta_k) \,\right) \right] \;=\; H[p(\theta_k)] \;-\; \mathbb{E}_{\hat{y}_k}\!\big[H[p(\theta_k \mid \hat{y}_k)]\big]$$ (5)

Under the Gaussian-Gaussian setup, EIG has a remarkably clean closed form. Because the posterior variance does not depend on the realized value of $\hat{y}$, the expected entropy reduction simplifies to:

$$\mathrm{EIG}(k, d) \;=\; \tfrac{1}{2}\log\!\left( \frac{\sigma_k^2}{\sigma_{\text{post},k}^2} \right) \;=\; \tfrac{1}{2}\log\!\left( 1 + \frac{\sigma_k^2}{\sigma_{\text{exp},k}^2(d)} \right)$$ (6)

Read this formula carefully

Equation (6) is the single most useful formula in this framework. Information gain depends only on a signal-to-noise ratio: the channel's prior variance divided by the experiment's noise variance. A noisy experiment on an uncertain channel can yield as much information as a precise experiment on a moderately uncertain one. Diminishing returns set in fast - going from a 1x to 4x ratio yields ~1 bit; going from 4x to 16x yields only one more.

Practical inputs required

Extract MMM marginals - for each channel $k$, compute $\sigma_k$ from the posterior samples of its ROI coefficient. Use the marginal standard deviation, not the standard error of the posterior mean.
Estimate experimental noise per design - estimate $\sigma_{\text{exp},k}^2(d)$ from a power analysis fit on the pre-period geo-week panel. This links design choices to achievable precision.
Compute the EIG grid - sweep across channels x designs, apply Eq. (6), and produce a ranked grid of $(k, d)$ pairs.
Constrain & select - apply geo non-overlap, budget caps, minimum duration, and non-interference. Solve the constrained selection problem; greedy is usually sufficient (see Submodularity).

For non-Gaussian posteriors - hierarchical models, non-conjugate priors, skewed contributions - a nested Monte Carlo estimator works in $O(N^2)$ samples. For prioritization purposes, the Gaussian approximation in Eq. (6) is almost always sufficient. Monte Carlo refinement is reserved for borderline channels.

Expected Value of Information (EVOI)

EIG treats every bit of uncertainty reduction as equally valuable. But uncertainty about a channel receiving 1% of spend is far less costly than equivalent uncertainty about a channel receiving 40% of spend. EVOI prices uncertainty in dollars by accounting for the asymmetric cost of being wrong. This is preposterior analysis in the sense of Raiffa & Schlaifer (1961): value the experiment by the decisions it would improve, before running it.

Define the downstream action as a budget allocation $\mathbf{b} = (b_1, \ldots, b_K)$ with $\sum_k b_k = B$. The organization picks $\mathbf{b}^*$ based on its current beliefs and earns utility $U(\mathbf{b}, \boldsymbol{\theta}) = \sum_k \theta_k \cdot f(b_k)$, where $f(\cdot)$ is a (usually concave) channel response function.

$$\mathrm{EVOI}(k) \;=\; \mathrm{EU}_{\text{after},k} \;-\; \mathrm{EU}_{\text{now}}$$ $$\mathrm{EU}_{\text{after},k} \;=\; \mathbb{E}_{\hat{y}_k}\!\left[ \max_{\mathbf{b}}\, \mathbb{E}_{\boldsymbol{\theta} \mid \hat{y}_k}\!\big[U(\mathbf{b}, \boldsymbol{\theta})\big] \right]$$ (7)

Closed-form EVOI is generally hard, but a useful linearized approximation drops out when reallocations only happen if the experiment changes the ordering of channels - i.e., if the posterior mean crosses a decision threshold:

$$\mathrm{EVOI}(k) \;\approx\; B \cdot s_k \cdot \mathbb{E}\!\left[\,\big|\hat{\theta}_{k,\text{post}} - \hat{\theta}_{k,\text{prior}}\big|\,\right] \cdot \mathbb{P}(\text{decision flips})$$ (8)

Deep diveThe status and limits of the EVOI linearization in Eq. (8)

Status of Eq. (8): a heuristic linearization, not a derived bound. It approximates the value integral by assuming (i) utility is locally linear in the reallocated budget $B \cdot s_k$, (ii) reallocation only happens when the posterior mean crosses the decision threshold, and (iii) the size of the post-decision gain scales with the posterior-mean shift. Each assumption fails in identifiable cases (strongly concave response near saturation; multi-channel rebalancing where several thresholds interact), and the units are only consistent when $\hat\theta$ is a per-dollar ROI applied to the dollars at stake. The shipped implementation (planning/evoi.py) computes EVOI by Monte Carlo over posterior draws and the explicit decision rule instead of Eq. (8); use the formula for intuition and the simulation for numbers.

Plain English

EVOI is large when (i) the budget at stake is big, (ii) the channel takes a lot of that budget, (iii) the experiment is likely to substantially shift the posterior mean, and (iv) there is a real chance the shift will cross a decision threshold (e.g., flip from "underspend" to "overspend").

Structural drivers

High prior uncertainty ($\sigma_k$ large) - wide credible intervals mean large potential for belief revision, increasing EVOI.
High spend share ($s_k$ large) - channels receiving big budget fractions have higher stakes; errors cost more.
Decision threshold proximity - when the posterior mean is near a budget reallocation threshold (e.g., borderline ROI = 1.0), even small uncertainty reductions have high value.
Convex response (saturation) - when the response function is concave, correcting misallocation has compounding returns.

EVOI requires an explicit decision rule

EVOI is only as good as the assumed rule for translating posteriors into budgets. Make it explicit. A simple rule: reallocate proportional to posterior mean ROI, subject to floor/ceiling constraints. A richer rule: optimizer with saturation curves and channel-level minimums. Document whichever rule you use - the EVOI computation depends on it.

The 2x2 Priority Map

Combine EIG and EVOI into a single priority score and a stakeholder-friendly visualization:

$$\mathrm{Score}(k) \;=\; \lambda \cdot \widetilde{\mathrm{EIG}}(k, d^*) \;+\; (1-\lambda) \cdot \widetilde{\mathrm{EVOI}}(k) \;-\; \gamma \cdot \mathrm{cost}(k, d^*) \;-\; \rho \cdot \mathrm{risk}(k)$$ (9)

where $\widetilde{\cdot}$ denotes min-max normalization to $[0, 1]$, $\lambda$ trades off epistemic vs. instrumental value, $\gamma$ penalizes expensive experiments, and $\rho$ penalizes operational risk.

EIG ↑
EVOI →

Low EVOI

High EVOI

High EIG

Q2 - Run for learning

Uncertain, low-spend

Useful for model calibration and building geo-lift infrastructure, but lower urgency. Schedule when capacity allows.

Q1 - Highest priority

Uncertain, high-spend

Run these first. Both epistemic and instrumental value are maximized. The headline targets every cycle.

Low EIG

Q4 - Deprioritize

Well-known, low-spend

Skip experiments. Accept the MMM estimate. Revisit only if spend materially increases or the model changes.

Q3 - Monitor, don't test

Well-known, high-spend

The MMM is already well-identified here. Re-test only if model specification changes or spend levels shift dramatically.

Submodularity & Greedy Selection

A non-obvious but enormously useful fact: the EIG of a set of experiments is a submodular function of the set. Submodularity means diminishing returns - adding the same experiment to a small portfolio yields more information than adding it to a large one. Formally, for sets $A \subseteq A'$:

$$\mathrm{EIG}(A \cup \{k\}) - \mathrm{EIG}(A) \;\ge\; \mathrm{EIG}(A' \cup \{k\}) - \mathrm{EIG}(A')$$ (10)

Submodularity has a major practical payoff: greedy selection is provably near-optimal. The Nemhauser-Wolsey-Fisher (1978) result says greedy attains at least $(1 - 1/e) \approx 63\%$ of the optimum under cardinality constraints. Krause & Golovin (2014) survey this style of greedy information-gathering portfolio selection. The pseudocode is trivial:

# Greedy portfolio construction
A = []
while len(A) < C_ops:
    feasible = [k for k in channels if is_feasible(k, A)]
    if not feasible: break
    k_star = argmax(feasible, key=lambda k: score(k, A))
    A.append(k_star)
# By submodularity: EIG(A_greedy) ≥ (1 - 1/e) · EIG(A_optimal)

Practical implication: after about three experiments per cycle, additional ones add diminishing returns. The cycle naturally caps the test program at a sensible level - so any one quarter's experimental budget should be spent on the top 2-3 priorities, not spread thin.

Deep diveProof that portfolio EIG is submodular

Proof of submodularity

Let $\theta$ be the vector of channel parameters (true ROIs), and let $Y_A = \{Y_k\}_{k \in A}$ denote the experimental outcomes for a set $A$ of channels. We write $\mathrm{EIG}(A) = I(\theta;\, Y_A)$ for the mutual information between the parameters and the outcomes of experiment set $A$. The claim is the diminishing returns property: for any $A \subseteq A'$ and any $k \notin A'$,

$$I(\theta;\, Y_k \mid Y_A) \;\ge\; I(\theta;\, Y_k \mid Y_{A'})$$ (10a)

Step 1 — marginal gain equals conditional MI. By the chain rule of mutual information, $I(\theta;\, Y_{A \cup \{k\}}) = I(\theta;\, Y_A) + I(\theta;\, Y_k \mid Y_A)$, so the marginal gain from adding $k$ to set $A$ is exactly the conditional mutual information:

$$\mathrm{EIG}(A \cup \{k\}) - \mathrm{EIG}(A) \;=\; I(\theta;\, Y_k \mid Y_A)$$ (10b)

Step 2 — the key inequality. Write $B = A' \setminus A$, so $Y_{A'} = (Y_A, Y_B)$. Apply the chain rule to $I(\theta,\, Y_k;\; Y_B \mid Y_A)$ in two ways:

$$I(\theta,\, Y_k;\; Y_B \mid Y_A) \;=\; \underbrace{I(\theta;\, Y_B \mid Y_A)}_{\text{term I}} + \underbrace{I(Y_k;\, Y_B \mid \theta,\, Y_A)}_{= \;0}$$ (10c)

The second term vanishes because experimental outcomes for distinct channels are conditionally independent given $\theta$: once we know the true parameters, the outcome of experiment $k$ carries no information about the outcomes of experiments $B$. The same joint quantity expands the other way as:

$$I(\theta,\, Y_k;\; Y_B \mid Y_A) \;=\; \underbrace{I(Y_k;\, Y_B \mid Y_A)}_{\ge\; 0} + I(\theta;\, Y_B \mid Y_A,\, Y_k)$$ (10d)

Equating (10c) and (10d) and dropping the non-negative $I(Y_k;\, Y_B \mid Y_A)$ term:

$$I(\theta;\, Y_B \mid Y_A) \;\ge\; I(\theta;\, Y_B \mid Y_A,\, Y_k)$$ (10e)

Step 3 — assemble the bound. Apply the chain rule once more to $I(\theta;\, Y_k,\, Y_B \mid Y_A)$:

$$I(\theta;\, Y_k \mid Y_A) - I(\theta;\, Y_k \mid Y_{A'}) \;=\; I(\theta;\, Y_B \mid Y_A) - I(\theta;\, Y_B \mid Y_A,\, Y_k) \;\stackrel{(10\mathrm{e})}{\ge}\; 0 \qquad \square$$ (10f)

Gaussian specialisation. In the conjugate Gaussian model, the marginal gain from adding channel $k$ to set $A$ takes the closed form: $$I(\theta_k;\, Y_k \mid Y_A) \;=\; \tfrac{1}{2}\log\!\left(1 + \frac{\sigma_{k|A}^2}{\sigma_{E}^2}\right)$$ where $\sigma_{k|A}^2$ is the posterior variance of $\theta_k$ after observing $Y_A$. Because each experiment in $A$ can only reduce $\sigma_{k|A}^2$, the marginal gain is non-increasing in $|A|$ — the diminishing-returns property made explicit. Under the Nemhauser–Wolsey–Fisher (1978) theorem, greedy selection therefore attains at least $(1-1/e) \approx 63\%$ of the optimal portfolio value.

Experimental Design: From Channel Selection to Experimental Spec

Selecting which channel to test is a prioritization problem (EIG, EVOI, §3–4). Specifying how to run the test — which geos to treat, how much to spend, how long to run — is a distinct design problem, in the geo-experiment tradition of Vaver & Koehler (2011). A geo-level Bayesian MMM provides the inputs needed to answer every design question rigorously, replacing heuristic defaults with posterior-derived quantities.

Printable artifact

The outputs of this design section — estimand, geos, dose, duration, power (with the power-ceiling check), stopping rule, decision rule — are the fields of the Experiment Pre-Registration Memo, a fill-in one-pager signed before launch.

Quantities the MMM Provides

Quantity	Symbol	Use in test design
Per-geo channel posterior	$\beta_{k,g} \mid \mu_{k,g},\,\sigma_{k,g}^2$	Identifies geos with most prior uncertainty (treated-geo ranking)
Per-geo current spend	$x_{k,g}$	Anchors the saturation-curve evaluation point
Per-geo saturation curve	$S_k(\cdot;\,h_{k,g},\kappa_{k,g})$	Local slope determines the optimal spend increment
Per-geo residual variance	$\sigma_g^2$	Plugs directly into the standard error formula
Within-geo serial correlation	$\rho_g$	Determines how duration converts to precision
Geo random-effect posteriors	$\mathbf{u}_g$	Mahalanobis distance for matched-pair construction
Cross-geo posterior correlations	$\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})$	Spillover and contamination diagnostics

Step 1 · Treated-Geo Selection by Information Yield

Not all geos contribute equally to resolving channel uncertainty. The information yield $I_{k,g}$ scores each geo by how much a test there would compress the posterior on $\beta_{k,g}$:

I_{k,g} \;=\; \frac{\sigma_{k,g}^2 \cdot \bigl[S_k'(x_{k,g})\bigr]^2}{\sigma_g^2}

The numerator is the product of prior uncertainty (wide posterior = more to learn) and saturation slope squared (steep curve = spend increment translates to large signal). The denominator is residual noise. A geo saturated for channel $k$ has a small $S_k'(x_{k,g})$ and therefore low yield regardless of how uncertain the posterior is. Rank geos descending by $I_{k,g}$ and select the top $G$.

Greedy selection applies here too. By the same submodularity argument as §4, greedy selection of geos by $I_{k,g}$ achieves at least $(1-1/e) \approx 63\%$ of the optimal joint information gain. Run the greedy pass once; the ranking is cheap to compute from the fitted MMM.

Step 2 · Matched-Pair Construction via Posterior Mahalanobis Distance

For each selected treatment geo $g$, choose a control geo $g'$ that minimises posterior Mahalanobis distance over the geo random effects $\mathbf{u}_g$:

d^2(g,\,g') \;=\; (\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})^\top \boldsymbol{\Sigma}_u^{-1}\,(\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})

where $\boldsymbol{\mu}_{u,g}$ is the posterior mean of geo $g$'s random-effect vector and $\boldsymbol{\Sigma}_u$ is the posterior covariance matrix of those effects. Matching on Mahalanobis distance over the full posterior — rather than on raw KPI correlation — accounts for parameter uncertainty and should reduce residual control-group variance (a heuristic expectation, not a measured benchmark — validate the gain on your own panel). The construction is Mahalanobis-metric matching (cf. Imbens & Rubin 2015, who match on raw covariates rather than posterior random effects). Exclude any control candidate with cross-geo posterior correlation $|\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})| > 0.5$ (spillover screen; see Step 6).

Step 3 · Treatment Intensity from the Saturation Curve

The optimal spend increment for geo $g$ equates signal-to-noise across the panel:

\Delta\mathrm{spend}_g^* \;=\; \frac{\sigma_g}{S_k'(x_{k,g})} \cdot \sqrt{\frac{2\,D(T,\rho_g)}{GT} \cdot \frac{\eta}{1-\eta}} \cdot \frac{1}{\sigma_{k,g}}

Here $D(T,\rho_g)$ is the AR(1) design effect for a $T$-week mean, defined in Step 4 below.

where $\eta$ is target power, $G$ is the number of treatment geos, and $T$ is planned duration. The factor $S_k'(x_{k,g})^{-1}$ is the key innovation: a geo near saturation (small slope) requires a proportionally larger spend increment to produce the same signal as a sub-saturated geo. In practice $\Delta\mathrm{spend}_g^*$ varies 3–5× across geos for the same channel; applying a uniform percentage uplift systematically under-powers some geos and wastes budget in others.

Step 4 · Duration from Serial Correlation

Weekly KPI within a geo is serially correlated with AR(1) coefficient $\rho_g$, estimated from the MMM residuals. Ignoring this serial correlation is the classic way to overstate the precision of difference-in-differences estimates (Bertrand, Duflo & Mullainathan 2004). Under AR(1), the variance of a $T$-week mean is inflated by the design effect

D(T,\rho_g) \;=\; 1 + 2\sum_{k=1}^{T-1}\Bigl(1-\tfrac{k}{T}\Bigr)\rho_g^{\,k} \;\;\xrightarrow[T\to\infty]{}\;\; \frac{1+\rho_g}{1-\rho_g},

so the corrected standard error for the difference-in-differences estimator is:

\mathrm{SE}(\hat\tau) \;=\; \sigma_y\,\sqrt{\tfrac{1}{n_t} + \tfrac{1}{n_c}} \cdot \sqrt{\frac{D(T,\rho_g)}{T}}

When $\rho_g = 0$ this reduces to the familiar $\sigma/\sqrt{T}$ form. A common mistake — and one an earlier version of this page made — is to use the exchangeable design effect $1+(T-1)\rho$ here. That formula is correct for equicorrelated (cluster-randomized) observations, but under AR(1) the correlation decays with lag, so the exchangeable factor badly overstates the variance and hence the required duration: at $\rho_g=0.4$, $T=8$, the exchangeable factor is $3.8$ while the AR(1) design effect is $\approx 2.1$ — nearly double the implied test length for no reason. The minimum duration $T^*$ for target power $1-\beta$ solves:

T^* \;=\; \frac{(z_{1-\alpha/2} + z_{1-\beta})^2\;\sigma_y^2\;(n_t^{-1}+n_c^{-1}) \,D(T^*,\rho_g)}{\Delta y^2}

The implicit equation is solved by one-dimensional search. Because $D(T,\rho_g)$ saturates toward $(1+\rho_g)/(1-\rho_g)$, the marginal information from each additional week falls off quickly — for $\rho_g \in [0.2, 0.5]$ the formula implies most designs stop earning meaningful EIG within roughly 6–10 weeks (a consequence of this math, not an empirical benchmark; the calculator recomputes it for your inputs). Running longer accumulates cost with minimal additional information.

Delta-method power ceiling. The ROI posterior uncertainty $\sigma_{k,g}$ also enters through the signal: $\mathrm{Var}(\Delta y) \geq \Delta b^2\,\sigma_\theta^2$. This term is independent of $T$ and sets an asymptotic power floor: no duration is long enough to recover power lost to a wide ROI posterior. The design is only feasible if the power ceiling — $\lim_{T\to\infty} \mathrm{Power}(T)$ — exceeds the target.

Step 5 · Pre-Experiment Simulation

Before committing budget, validate the proposed design by using the fitted MMM as a forward simulator:

Sample $\tilde\theta$ from the posterior $p(\theta \mid \mathcal{D})$.
Generate simulated KPI for each geo under the proposed spend schedule, adding AR(1) noise with innovation variance $\sigma_g^2$ and coefficient $\rho_g$ (simulate the process itself rather than scaling i.i.d. noise by a design factor — the simulation should carry the true autocorrelation).
Fit the planned estimator (DiD, SCM, etc.) to the simulated data and record whether it reaches significance at $\alpha$.
Repeat 500–2,000 times; the fraction of significant replicates is the simulated power.

Simulated power catches issues — skewed KPI distributions, heterogeneous geo variance, adstock contamination — that closed-form approximations miss. If simulated power falls short, adjust $\Delta\mathrm{spend}_g^*$, add geos, or extend duration and re-simulate.

Step 6 · Spillover and Contamination Diagnostics

Cross-geo spillover (national TV halo, retargeting bleed across DMAs) deflates the estimated lift by partially treating the control. Diagnose contamination with the cross-geo posterior correlation:

r_{g,g'} \;=\; \mathrm{Cor}\!\bigl(\beta_{k,g},\,\beta_{k,g'}\bigr)

Flag any control candidate with $|r_{g,g'}| > 0.5$ and exclude it from the donor pool. Adstock carryover introduces time-domain contamination in the first 1–2 weeks; include a wash-in buffer or model carryover explicitly in the geo-level analysis. Both diagnostics are computed from the fitted MMM posterior — no additional data collection is required.

Worked Example — CTV Campaign

Five candidate geos were scored for a Connected-TV channel. The table summarises the design output of Steps 1–6.

Geo	Role	$I_{k,g}$ rank	Saturation slope $S'$	$\Delta\mathrm{spend}_g^*$	$T^*$ (weeks)
Atlanta	Treatment	1st	0.82 (sub-saturated)	+18%	7
Phoenix	Treatment	2nd	0.51 (mid-saturated)	+31%	9
Tampa	Control (Atlanta)	—	$d^2 = 0.9$	0%	—
Portland	Control (Phoenix)	—	$d^2 = 1.4$	0%	—
St. Louis	Excluded	3rd	$r_{g,g'} = 0.71$	—	—

Atlanta's lower $\Delta\mathrm{spend}^*$ (18% vs 31%) reflects its steeper saturation slope: the same ROI uncertainty is resolved more cheaply where the response curve is still responsive. St. Louis was excluded despite ranking 3rd by $I_{k,g}$ because its cross-geo posterior correlation with Atlanta ($r = 0.71$) exceeded the spillover threshold.

Channel × geo generalisation. The same greedy submodular selection extends to portfolios of (channel, geo) pairs. Score every pair by $I_{k,g}$, apply the Mahalanobis exclusion and spillover screen, then select greedily up to the experiment budget. The $(1-1/e)$ approximation bound from §4 applies to the joint selection problem without modification.

Calibration: Using Experiments to Anchor the Next MMM

Calibration is the process of using experimental causal estimates to anchor the MMM's channel parameters so that model-implied ROIs are consistent with ground truth. This can range from soft regularization (Bayesian priors informed by experiments) to hard constraints (fixing parameters at experimental point estimates - which we generally avoid). Three mechanisms cover the main use cases.

Before calibration

Parameters identified by observational variation alone
Posteriors driven mostly by structural assumptions
Attribution may reflect media collinearity, not causality
ROI rankings unstable across model specifications

After calibration

Experimental likelihoods constrain key channel parameters
Posteriors tighter on tested channels; partial pooling propagates
Attribution anchored to causal reality for tested channels
Remaining uncertainty quantified, not hidden

M1 - Soft prior

Last quarter's posterior becomes this quarter's prior

The most principled approach: use the experimental posterior as the prior on the corresponding MMM channel parameter. The MMM likelihood pulls unconstrained channels in fitting; the informed prior anchors the calibrated channel near its experimentally-identified value.

M2 - Likelihood augmentation

Treat the experiment as one more data point

Add the experimental estimate as an additional term in the joint likelihood. Cleaner when the experiment overlaps the MMM training window. Requires a mapping $g(\theta_k)$ from MMM parameters to the implied geo-lift effect size.

M3 - Hierarchical pooling

Tested channels pull untested siblings via shared structure

Information from a tested channel propagates to untested channels via a shared category-level hyperparameter. An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate.

Mechanism 1 - Soft calibration via Bayesian priors

The experiment on channel $k$ produces $p(\tau_k \mid \hat{y}_k)$. Pass that distribution as the prior on the corresponding MMM parameter:

$$p(\tau_k \mid \hat{y}_k) \;\propto\; p(\hat{y}_k \mid \tau_k) \cdot p(\tau_k)$$ $$p(\theta_k) \;\leftarrow\; p(\tau_k \mid \hat{y}_k) \quad \text{(with appropriate scale transform)}$$ $$p(\boldsymbol{\theta} \mid y, X) \;\propto\; p(y \mid X, \boldsymbol{\theta}) \cdot p(\theta_k \mid \hat{y}_k) \cdot p(\boldsymbol{\theta}_{-k})$$ (11)

The scale transform matters. A geo-lift $\hat{\tau}$ is in outcome units per spend-dollar in the test window, while an MMM coefficient may be in standardized units. Convert carefully and document the conversion rule in the calibration spec.

Mechanism 2 - Likelihood augmentation

$$\log p(y, \hat{y}_k \mid X, \boldsymbol{\theta}) \;=\; \log p(y \mid X, \boldsymbol{\theta}) \;+\; \log p(\hat{y}_k \mid \theta_k)$$ $$p(\hat{y}_k \mid \theta_k) \;=\; \mathcal{N}\!\left(g(\theta_k),\; \sigma_{\text{exp},k}^2\right)$$ (12)

where $g(\theta_k)$ maps MMM channel parameters to the implied geo-lift effect size - accounting for the spend delta in the test, the saturation curve evaluated at the test spend level, and any adstock effects in the experiment window. Constructing $g(\cdot)$ correctly is the hardest part of this mechanism; miscalibration introduces systematic bias.

Mechanism 3 - Hierarchical pooling

$$\theta_k \mid \mu_{\text{cat}}, \sigma_{\text{cat}} \;\sim\; \mathcal{N}\!\left(\mu_{\text{cat}[k]},\; \sigma_{\text{cat}[k]}^2\right)$$ (13)

where $\mathrm{cat}[k]$ is the media category of channel $k$ (e.g., video, search, social). An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate. Pooling strength is controlled by $\sigma_{\text{cat}}$ via a hyperprior - tighter $\sigma_{\text{cat}}$ pools more strongly.

Calibration validity checks

Scale consistency - verify that the experimental $\hat{\tau}_k$ and the MMM-implied effect use the same outcome variable and time aggregation.
Coverage check - confirm that the MMM prior 95% CI for channel $k$ covers the experimental point estimate. If not, diagnose model misspecification before calibrating.
Quality gate - only use experiments that pass pre-specified quality checks (balance test, pre-period MAPE, sufficient MDE, no contamination evidence) for calibration inputs.
Attribution reconciliation - after refitting, confirm total attributed conversions still match observed outcomes - calibration should not break aggregate coherence.
Predictive validation - hold out a test period and compare pre- vs. post-calibration MAPE. Calibration should improve predictive accuracy or at minimum not degrade it.

When experiment and MMM disagree

If the experimental estimate falls outside the MMM posterior's plausible range, do not automatically trust either side. Diagnose first: (1) Did the experiment have sufficient power? (2) Was there contamination or geo spillover? (3) Is the MMM response curve evaluated at the correct spend level? (4) Are there temporal confounders in the experiment window? Work through this checklist before resolving the conflict. The disagreement itself is documented evidence of model fragility - which is information you want.

The Virtuous Cycle: Six-Step Orbit

The full adaptive loop treats MMM calibration as a sequential Bayesian inference problem across measurement cycles. Each cycle has six steps - three modeling steps that happen against historical data, and three field steps that happen in the market.

Fit baseline MMM

Run the Bayesian MMM on historical data with weakly informative priors. Extract posterior $p(\boldsymbol{\theta})$ over all channel ROI parameters. Flag channels with $\sigma_k$ above threshold.

Score EIG & EVOI

Use the T₀ posterior as the prior. Estimate $\sigma_{\text{exp}}$ per channel given geo footprint. Produce the priority grid and select the top-K portfolio subject to operational constraints.

Run experiments (pre-registered)

Execute the selected geo-lift / matched-market tests with pre-specified designs. Log all deviations. Apply ITT analysis by default. Produce $p(\tau_k \mid \hat{y}_k)$ per tested channel.

Calibrate the next MMM

Refit via informed priors (M1) or augmented likelihood (M2). Document the degree of belief revision: pre- vs post-calibration mean and width per channel.

Allocate from the calibrated posterior

Make budget decisions from the calibrated MMM. Tag each line as experiment-backed or model-only. Report confidence tiers - distinguish evidence quality from point estimates.

Re-score for next cycle

Calibrated channels now have tighter posteriors - their EIG and EVOI drop. Recompute the priority grid with updated beliefs. Previously deprioritized channels may now rise. Begin again.

The single-sentence summary

The MMM tells us where to look; the experiments tell us what's actually there; the next MMM bakes in what we learned. Each loop tightens the parts of the picture that matter most for budget decisions.

Information decay & re-experimentation triggers

Experimental calibration has a shelf life. An experiment from 18 months ago, run under different competitive conditions, spend levels, or creative strategies, may no longer accurately represent current channel effectiveness. Model an exponential decay of experimental information over time:

$$\sigma_{k,\text{eff}}^2(t) \;=\; \sigma_{k,\text{post}}^2 \cdot \exp(\lambda_{\text{decay}} \cdot t)$$ (14)

where $\lambda_{\text{decay}}$ is calibrated from observed year-over-year MMM coefficient stability. As $\sigma_{k,\text{eff}}^2$ grows, EIG and EVOI for re-experimentation rise again - creating a principled re-experimentation schedule rather than relying on calendar-based intuitions. Decay rates differ by channel: fast-moving digital decays in 6-12 months; stable broadcast can hold 18-24.

Adaptive stopping criteria

A channel exits the active experimentation pool when all three conditions hold:

$$\mathrm{EIG}(k, d^*) < \varepsilon_{\text{EIG}} \quad \text{AND} \quad \mathrm{EVOI}(k) < \varepsilon_{\text{EVOI}} \quad \text{AND} \quad \mathrm{age}(\text{last\_exp}_k) < T_{\text{refresh}}$$ (15)

$T_{\text{refresh}}$ depends on media market dynamics, creative/targeting shifts, and competitive changes. Typical values: 12-24 months for stable channels, 6-12 months for fast-moving digital surfaces.

Why It Compounds

The point of running this loop is not any single experiment - it is the compounding of belief updates over successive cycles. A useful way to see the value is to plot the trajectories of four quantities that each cycle is trying to move:

MMM posterior precision - the 95% credible interval width on each channel's ROI. As experiments calibrate channels, widths contract.
Budget allocation - the share of total spend going to each channel. As ROIs become better-identified, allocations migrate toward the (unknown but increasingly well-estimated) optimum.
Misallocation cost - the gap between current spend and the spend that would maximize portfolio outcome under the current posterior. The running tab of wasted dollars per week.
Decision efficiency - the fraction of the theoretically optimal portfolio return that is actually captured, averaged over the current posterior. Rises as sigma shrinks and allocation converges to the optimum.

For a representative five-channel portfolio over five quarterly cycles, the loop produces the kind of trajectory below. The exact magnitudes depend on starting MMM uncertainty and spend asymmetry, but the shape - sharp early gains followed by diminishing returns - is structural.

−92%

Weekly misallocation cost over 4 cycles (calibrated path vs. baseline)

+25pp

Decision efficiency gain over the same window (% of optimal return captured)

Experiments per cycle - submodularity caps useful additions

New measurement contracts required - same vendor stack, better routing

How to read the four panels

Posterior contraction is the epistemic outcome — what we know better. Allocation migration is the operational outcome — what we do differently as a result. Misallocation cost is the economic outcome — what poor knowledge was costing us. Decision efficiency is the strategic outcome — the share of the theoretically optimal return we are actually capturing. A defensible measurement program connects all four and reports them quarterly.

Frequentist Tools, Bayesian Framing

Most geo-lift estimators in active production use are frequentist by construction: difference-in-differences with two-way fixed effects, synthetic control, augmented synthetic control, and time-based regression. Stakeholders speak that language. Pre-registration documents demand it. None of this is in tension with the Bayesian framework above.

The mental model

Run frequentist estimators because they are well-understood, defensible, and pre-registrable. Interpret their outputs as Gaussian likelihoods feeding into the conjugate update of Eq. (2). Stakeholders see CIs and p-values; the MMM team sees a precision-weighted posterior. Both are correct views of the same numbers.

The Bridge Equation

A frequentist point estimate $\hat{\beta}_k$ paired with a standard error $\mathrm{SE}_k$ is - under mild assumptions - a Gaussian likelihood. That likelihood plugs straight into the conjugate update alongside the MMM prior:

$$\hat{\beta}_k \mid \theta_k \;\sim\; \mathcal{N}(\theta_k,\; \mathrm{SE}_k^2) \;\;\Longrightarrow\;\; \theta_k \mid \hat{\beta}_k \;\sim\; \mathcal{N}(\mu_{\text{post}},\, \sigma_{\text{post}}^2)$$ $$\sigma_{\text{post}}^{-2} \;=\; \sigma_0^{-2} + \mathrm{SE}_k^{-2}, \qquad \mu_{\text{post}} \;=\; \sigma_{\text{post}}^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{\hat{\beta}_k}{\mathrm{SE}_k^2}\right)$$ (16)

The translation table is short: estimator output $\hat{\beta}_k$ becomes the likelihood mean; $\mathrm{SE}_k^2$ becomes the likelihood variance $\sigma_e^2$; the MDE relates to the achievable $\sigma_{\text{exp}}$ through the design's power calculation. Every frequentist estimator the measurement team already runs - DiD, SCM, ASC, time-based regression, BSTS / CausalImpact - feeds the same loop without anyone switching teams.

When the bridge isn't enough - small samples, multiple competing estimators, sequential monitoring, or borrowing strength across markets - go fully Bayesian on the experiment side too. Tools include BEST (robust Bayesian estimation), HDI + ROPE for decision-theoretic readouts, hierarchical / multilevel models, Bayesian synthetic control, and sequential Bayesian inference (no alpha-spending required). These all output posteriors that plug into Eq. (11) directly.

Implementation: A 90-Day Rollout

A staged rollout with explicit decision gates. Phase 1 is a 90-day pilot that delivers decision-grade evidence about whether to continue. The steady-state quarterly cadence is what comes after.

Phase 1

Foundation

Weeks 1-6

Audit existing MMM & geo-lift artifacts
Build Bayesian MMM in parallel
Posterior predictive checks pass
Side-by-side OLS vs. Bayes readout

Phase 2

First priority cycle

Weeks 6-8

Compute first EIG/EVOI grid
Produce 2x2 priority map
Stakeholder review & selection
Pre-register top 2-3 experiments

Phase 3

Calibration loop

Weeks 9-13

Run pre-registered experiments
Refit MMM with calibration priors
Pre/post allocation comparison
Phase 1 go/no-go gate

Phase 4

Steady state

Quarterly · ongoing

Quarterly priority + calibration cycle
OLS pipeline retired after 2 cycles
Re-experimentation triggers fire
Confidence-tier reporting standard

Prerequisites

Geo-week panel data - 3+ years of the outcome metric (revenue, conversions, sessions) at the geo grain. Daily is fine; the model will aggregate.
Per-channel media spend at the same geo grain or with a documented allocation rule from national to geo.
Pre-period MAPE benchmarks from the existing MMM, by channel and aggregate. Needed to credibly compare the new pipeline.
Past geo-lift readouts with point estimates and standard errors. These are the calibration inputs for the first cycle. Verdicts ("CTV worked") are not enough - we need the underlying numbers.
Tooling: PyMC ≥ 6 (with NumPyro for fast NUTS) for the Bayesian MMM and hierarchical experiment models; ArviZ for diagnostics; pyfixest or fixest (R) for the frequentist DiD baseline; JAX optional but recommended for speed.

Success Metrics

Metric	Definition	Target	Cadence
Holdout MAPE	Out-of-sample predictive accuracy of the calibrated MMM	`≤ OLS baseline`	Each cycle
Posterior contraction	Average reduction in $\sigma_k$ for tested channels post-calibration	`≥ 30%`	Each cycle
Misallocation Δ	Weekly misallocation cost vs. last cycle	`Falling, then flat`	Each cycle
Portfolio mROI	Marginal return of the next dollar across the portfolio	`Rising over 4 cycles`	Annual review
Calibration coverage	Fraction of spend in experiment-backed channels (vs. model-only)	`≥ 60% by Year 1`	Each cycle
Stakeholder fluency	Sponsor and planning team can explain priority map & confidence tiers in their own words	`Yes by Cycle 3`	Cycle 3 review

Anti-Patterns & Failure Modes

Six failure modes seen in similar programs. Calling them out up front so we recognize them when they happen.

"Run all the experiments we can think of." Submodularity says marginal information per experiment falls fast. Running 8 experiments to learn what 3 would have taught you is wasted budget. The priority engine is supposed to cap the program; treating it as an unconstrained idea-generator defeats the point.
"Calibrate every channel, every cycle." Calibration has a shelf life and a cost. Re-experimenting on a channel whose last test is 4 months old and stable is just lighting test budget on fire. Use the re-experimentation trigger; it exists for this reason.
"Use a flat prior so we're not biasing the answer." A flat prior reproduces the OLS pathology. Priors are how you regularize; refusing to use informative priors is a refusal to do the regularization work. Defensible > flat.
"The Bayesian MMM is wider than OLS, so it's worse." The Bayesian MMM is wider because the OLS interval was lying about its precision. Width is information, not weakness. Reporting templates need to make this distinction explicit.
"Let's just plug the experimental point estimate in as a fixed parameter." Hard calibration by point-substitution discards the experimental SE and produces over-confidence in the calibrated channel. Use the soft prior (M1) or augmented likelihood (M2). Never the point.
"The model said it; I just report what comes out." The Bayesian MMM "says" what your priors and likelihood structure say. The analyst owns those choices; reporting templates should attribute structural decisions to the analyst, not to "the model."

The most insidious failure mode

Stakeholders accept the new framework but quietly continue making allocation decisions on instinct, treating the priority map as a research artifact rather than a budget recommendation. This is invisible from the inside - the team produces the deliverables, gets compliments, and nothing gets used. Detection: at the second-cycle review, ask the sponsor to point at a specific budget reallocation that happened because of a calibrated posterior. If they can't, the program isn't actually running yet - only the appearance of it is.

Closing Principle

The MMM and geo-lift experiments are not competing measurement paradigms - they are complementary nodes in a Bayesian inference graph. The MMM provides a coherent joint model of all channels with full coverage; experiments provide local causal identification for high-priority channels. Information flows are bidirectional: the MMM shapes experiment design (via EIG/EVOI prioritization), and experiments calibrate the MMM (via informed priors or augmented likelihoods). Over successive cycles, this adaptive loop systematically contracts the uncertainty that actually matters for budget decisions.

Where to go next

To understand the math foundations in more depth, see the Bayesian Workflow and Causal Inference guides. To see how the MMM is implemented, see the Modeling Guide and the Technical Guide. To see what the framework produces, see the Demos & Reports.

References

Chaloner, K., & Verdinelli, I. (1995). Bayesian Experimental Design: A Review. Statistical Science, 10(3), 273-304.
Lindley, D. V. (1956). On a Measure of the Information Provided by an Experiment. Annals of Mathematical Statistics, 27(4), 986-1005.
Raiffa, H., & Schlaifer, R. (1961). Applied Statistical Decision Theory. Division of Research, Harvard Business School.
Nemhauser, G. L., Wolsey, L. A., & Fisher, M. L. (1978). An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming, 14, 265-294.
Krause, A., & Golovin, D. (2014). Submodular Function Maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press.
Vaver, J., & Koehler, J. (2011). Measuring Ad Effectiveness Using Geo Experiments. Google Research.
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How Much Should We Trust Differences-in-Differences Estimates? Quarterly Journal of Economics, 119(1), 249-275.
Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.