Closed-Loop Measurement & Calibration

A comprehensive guide to integrating measurement into Marketing Mix Modeling. The Bayesian MMM and geo-lift experiments are not competing paradigms - they are complementary nodes in one inference graph. This page lays out how to wire them into a single self-correcting cycle that compounds learning over time.

Why this page exists

Most teams already run an MMM and a geo-lift program - and treat them as separate budgets, separate timelines, and separate answers. The closed-loop framework wires them together: the MMM chooses which experiments to run, and the experiments calibrate the next MMM. Each cycle tightens the parts of the picture that matter most for budget decisions.

The Two-Paradigm Problem

The MMM is the system of record for "what is each channel worth." It produces a clean ROI per channel, and the planning team allocates budget against those numbers. Separately, the team runs geo-lift experiments - paid search holdouts, CTV market tests, audio matched-market trials - each producing its own causal estimate. Sometimes the two agree; sometimes they don't. When they disagree, someone picks the source they trust more for that channel and moves on.

MMM System

"CTV ROI 1.4x"

Tight credible interval. Joint over all channels. Identified by observational variation - which means entangled when channels move together.

Experiment System

"CTV ROI 2.1x"

Wider interval. Single-channel scope. Causally identified by the randomization itself - so the estimate isn't entangled with the rest of the marketing plan.

The structural problem is not that either system is bad - both are well-built. The problem is that neither is wired into the other. The MMM doesn't know which channels need experimental backup; the experiments don't feed back into the next MMM fit. So the team keeps paying for both and keeps accepting whichever number the analyst trusted that week.

A tight number is not a true number

An MMM produces tight intervals when the model is well-regularized and the historical data tells a consistent story. But "consistent" is not the same as "correct" - when channels move in lockstep (national TV and digital flighted together; promo periods overlapping with seasonality), the model cannot tell them apart, and a regularizing prior produces a tight number anyway, anchored to an arbitrary point along an unidentified ridge. Confidently wrong is a real failure mode.

Two Complementary Objectives

Once you accept that experiments are the calibration tool, you still face the question of which experiments. Two distinct objectives emerge - they are correlated but not identical, and a good framework integrates both.

Epistemic - reduce uncertainty

  • Quantify posterior entropy per channel
  • Pick experiments that maximally collapse the posterior
  • Prioritize where beliefs are least grounded
  • Metric: Expected Information Gain (EIG)

Instrumental - improve decisions

  • Map ROI uncertainty to budget decision quality
  • Prioritize where errors are most costly
  • Quantify the dollar value of resolving each uncertainty
  • Metric: Expected Value of Information (EVOI)

A boutique channel may have huge posterior variance (high epistemic priority) but receive 1% of spend (low decision stakes). A workhorse channel may have moderate variance but receive 30% of spend, making any uncertainty extraordinarily costly. EIG and EVOI together produce a priority map that respects both.

Bayesian Foundations: Prior, Likelihood, Posterior

Bayesian inference is the discipline of updating beliefs in the face of new evidence. Three objects appear repeatedly in this guide:

$$p(\theta \mid \hat{y}) \;=\; \frac{p(\hat{y} \mid \theta)\, p(\theta)}{p(\hat{y})} \;\propto\; \underbrace{p(\hat{y} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}$$ (1)

For the rest of this guide, $\theta_k$ denotes a channel-level causal parameter (typically ROI or elasticity) for channel $k$, and $\hat{y}_k$ denotes the noisy estimate produced by an experiment. The MMM provides $p(\theta_k)$. The experiment provides $p(\hat{y}_k \mid \theta_k)$. The product is what we want.

The Gaussian Conjugate Update

When the prior is Gaussian and the experiment likelihood is Gaussian (the standard assumption for geo-lift difference-in-differences), the math simplifies dramatically. With prior $\theta \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and likelihood $\hat{y} \mid \theta \sim \mathcal{N}(\theta, \sigma_e^2)$, the posterior is also Gaussian:

$$\theta \mid \hat{y} \;\sim\; \mathcal{N}\!\left(\mu_{\text{post}},\; \sigma_{\text{post}}^2\right)$$ $$\sigma_{\text{post}}^{-2} \;=\; \sigma_0^{-2} + \sigma_e^{-2}, \qquad \mu_{\text{post}} \;=\; \sigma_{\text{post}}^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{\hat{y}}{\sigma_e^2}\right)$$ (2)
Plain English

Precision (the inverse of variance) is additive. The posterior precision is the prior precision plus the experiment's precision. The posterior mean is a precision-weighted average of the prior mean and the experimental estimate - whichever is more precise pulls the posterior more strongly toward itself.

Equation (2) is the workhorse for almost everything that follows. EIG, calibration, and the bridge from frequentist estimates all reduce to applications of this single update.

Decision Structure

Let $\theta_k$ denote the true causal ROI for channel $k \in \{1, \ldots, K\}$. The MMM produces a joint posterior $p(\boldsymbol{\theta}) = p(\theta_1, \ldots, \theta_K \mid y, X, \mathcal{M})$. The planner selects an experimental portfolio $A \subseteq \{1, \ldots, K\}$ together with a design $d_k$ for each chosen channel:

$$A \;=\; \{(k_1, d_1),\, (k_2, d_2),\, \dots\}, \qquad d \;=\; \{n_{\text{test}},\, \Delta_{\text{spend}},\, \text{duration},\, \ldots\}$$ $$\text{subject to}\quad |A| \le C_{\text{ops}},\quad \mathrm{cost}(A) \le B$$ (3)

Each experiment produces a noisy estimate of the true causal effect. For a geo-lift difference-in-differences design, this is approximately Gaussian:

$$\hat{\tau}_k \mid \tau_k, d \;\sim\; \mathcal{N}\!\left(\tau_k,\; \sigma_{\text{exp},k}^2(d)\right), \qquad \sigma_{\text{exp},k}^2(d) \;\approx\; \frac{2\, s^2_{\text{geo},k}}{n\, T} \cdot \frac{1}{\Delta^2_{\text{spend},k}}$$ (4)

where $s^2_{\text{geo},k}$ is the residual geo-week variance from the pre-period, $n$ is geos per arm, and $T$ is the test duration in weeks. More geos, longer tests, and bigger spend deltas all shrink experimental noise.

Expected Information Gain (EIG)

For a given channel $k$ and design $d$, the Expected Information Gain is the expected KL divergence from the prior to the updated posterior, integrated over what experimental outcome we might see. Equivalently, it is the mutual information between the unknown parameter and the experimental outcome:

$$\mathrm{EIG}(k, d) \;=\; \mathbb{E}_{\hat{y}_k}\!\left[\, \mathrm{KL}\!\left(\, p(\theta_k \mid \hat{y}_k, d) \,\middle\|\, p(\theta_k) \,\right) \right] \;=\; H[p(\theta_k)] \;-\; \mathbb{E}_{\hat{y}_k}\!\big[H[p(\theta_k \mid \hat{y}_k)]\big]$$ (5)

Under the Gaussian-Gaussian setup, EIG has a remarkably clean closed form. Because the posterior variance does not depend on the realized value of $\hat{y}$, the expected entropy reduction simplifies to:

$$\mathrm{EIG}(k, d) \;=\; \tfrac{1}{2}\log\!\left( \frac{\sigma_k^2}{\sigma_{\text{post},k}^2} \right) \;=\; \tfrac{1}{2}\log\!\left( 1 + \frac{\sigma_k^2}{\sigma_{\text{exp},k}^2(d)} \right)$$ (6)

Read this formula carefully

Equation (6) is the single most useful formula in this framework. Information gain depends only on a signal-to-noise ratio: the channel's prior variance divided by the experiment's noise variance. A noisy experiment on an uncertain channel can yield as much information as a precise experiment on a moderately uncertain one. Diminishing returns set in fast - going from a 1x to 4x ratio yields ~1 bit; going from 4x to 16x yields only one more.

Practical inputs required

For non-Gaussian posteriors - hierarchical models, non-conjugate priors, skewed contributions - a nested Monte Carlo estimator works in $O(N^2)$ samples. For prioritization purposes, the Gaussian approximation in Eq. (6) is almost always sufficient. Monte Carlo refinement is reserved for borderline channels.

Expected Value of Information (EVOI)

EIG treats every bit of uncertainty reduction as equally valuable. But uncertainty about a channel receiving 1% of spend is far less costly than equivalent uncertainty about a channel receiving 40% of spend. EVOI prices uncertainty in dollars by accounting for the asymmetric cost of being wrong.

Define the downstream action as a budget allocation $\mathbf{b} = (b_1, \ldots, b_K)$ with $\sum_k b_k = B$. The organization picks $\mathbf{b}^*$ based on its current beliefs and earns utility $U(\mathbf{b}, \boldsymbol{\theta}) = \sum_k \theta_k \cdot f(b_k)$, where $f(\cdot)$ is a (usually concave) channel response function.

$$\mathrm{EVOI}(k) \;=\; \mathrm{EU}_{\text{after},k} \;-\; \mathrm{EU}_{\text{now}}$$ $$\mathrm{EU}_{\text{after},k} \;=\; \mathbb{E}_{\hat{y}_k}\!\left[ \max_{\mathbf{b}}\, \mathbb{E}_{\boldsymbol{\theta} \mid \hat{y}_k}\!\big[U(\mathbf{b}, \boldsymbol{\theta})\big] \right]$$ (7)

Closed-form EVOI is generally hard, but a useful linearized approximation drops out when reallocations only happen if the experiment changes the ordering of channels - i.e., if the posterior mean crosses a decision threshold:

$$\mathrm{EVOI}(k) \;\approx\; B \cdot s_k \cdot \mathbb{E}\!\left[\,\big|\hat{\theta}_{k,\text{post}} - \hat{\theta}_{k,\text{prior}}\big|\,\right] \cdot \mathbb{P}(\text{decision flips})$$ (8)
Plain English

EVOI is large when (i) the budget at stake is big, (ii) the channel takes a lot of that budget, (iii) the experiment is likely to substantially shift the posterior mean, and (iv) there is a real chance the shift will cross a decision threshold (e.g., flip from "underspend" to "overspend").

Structural drivers

EVOI requires an explicit decision rule

EVOI is only as good as the assumed rule for translating posteriors into budgets. Make it explicit. A simple rule: reallocate proportional to posterior mean ROI, subject to floor/ceiling constraints. A richer rule: optimizer with saturation curves and channel-level minimums. Document whichever rule you use - the EVOI computation depends on it.

The 2x2 Priority Map

Combine EIG and EVOI into a single priority score and a stakeholder-friendly visualization:

$$\mathrm{Score}(k) \;=\; \lambda \cdot \widetilde{\mathrm{EIG}}(k, d^*) \;+\; (1-\lambda) \cdot \widetilde{\mathrm{EVOI}}(k) \;-\; \gamma \cdot \mathrm{cost}(k, d^*) \;-\; \rho \cdot \mathrm{risk}(k)$$ (9)

where $\widetilde{\cdot}$ denotes min-max normalization to $[0, 1]$, $\lambda$ trades off epistemic vs. instrumental value, $\gamma$ penalizes expensive experiments, and $\rho$ penalizes operational risk.

EIG ↑
EVOI →
Low EVOI
High EVOI
High EIG
Q2 - Run for learning

Uncertain, low-spend

Useful for model calibration and building geo-lift infrastructure, but lower urgency. Schedule when capacity allows.

Q1 - Highest priority

Uncertain, high-spend

Run these first. Both epistemic and instrumental value are maximized. The headline targets every cycle.

Low EIG
Q4 - Deprioritize

Well-known, low-spend

Skip experiments. Accept the MMM estimate. Revisit only if spend materially increases or the model changes.

Q3 - Monitor, don't test

Well-known, high-spend

The MMM is already well-identified here. Re-test only if model specification changes or spend levels shift dramatically.

Submodularity & Greedy Selection

A non-obvious but enormously useful fact: the EIG of a set of experiments is a submodular function of the set. Submodularity means diminishing returns - adding the same experiment to a small portfolio yields more information than adding it to a large one. Formally, for sets $A \subseteq A'$:

$$\mathrm{EIG}(A \cup \{k\}) - \mathrm{EIG}(A) \;\ge\; \mathrm{EIG}(A' \cup \{k\}) - \mathrm{EIG}(A')$$ (10)

Submodularity has a major practical payoff: greedy selection is provably near-optimal. The Nemhauser-Wolsey-Fisher (1978) result says greedy attains at least $(1 - 1/e) \approx 63\%$ of the optimum under cardinality constraints. The pseudocode is trivial:

# Greedy portfolio construction
A = []
while len(A) < C_ops:
    feasible = [k for k in channels if is_feasible(k, A)]
    if not feasible: break
    k_star = argmax(feasible, key=lambda k: score(k, A))
    A.append(k_star)
# By submodularity: EIG(A_greedy) ≥ (1 - 1/e) · EIG(A_optimal)

Practical implication: after about three experiments per cycle, additional ones add diminishing returns. The cycle naturally caps the test program at a sensible level - so any one quarter's experimental budget should be spent on the top 2-3 priorities, not spread thin.

Proof of submodularity

Let $\theta$ be the vector of channel parameters (true ROIs), and let $Y_A = \{Y_k\}_{k \in A}$ denote the experimental outcomes for a set $A$ of channels. We write $\mathrm{EIG}(A) = I(\theta;\, Y_A)$ for the mutual information between the parameters and the outcomes of experiment set $A$. The claim is the diminishing returns property: for any $A \subseteq A'$ and any $k \notin A'$,

$$I(\theta;\, Y_k \mid Y_A) \;\ge\; I(\theta;\, Y_k \mid Y_{A'})$$ (10a)

Step 1 — marginal gain equals conditional MI. By the chain rule of mutual information, $I(\theta;\, Y_{A \cup \{k\}}) = I(\theta;\, Y_A) + I(\theta;\, Y_k \mid Y_A)$, so the marginal gain from adding $k$ to set $A$ is exactly the conditional mutual information:

$$\mathrm{EIG}(A \cup \{k\}) - \mathrm{EIG}(A) \;=\; I(\theta;\, Y_k \mid Y_A)$$ (10b)

Step 2 — the key inequality. Write $B = A' \setminus A$, so $Y_{A'} = (Y_A, Y_B)$. Apply the chain rule to $I(\theta,\, Y_k;\; Y_B \mid Y_A)$ in two ways:

$$I(\theta,\, Y_k;\; Y_B \mid Y_A) \;=\; \underbrace{I(\theta;\, Y_B \mid Y_A)}_{\text{term I}} + \underbrace{I(Y_k;\, Y_B \mid \theta,\, Y_A)}_{= \;0}$$ (10c)

The second term vanishes because experimental outcomes for distinct channels are conditionally independent given $\theta$: once we know the true parameters, the outcome of experiment $k$ carries no information about the outcomes of experiments $B$. The same joint quantity expands the other way as:

$$I(\theta,\, Y_k;\; Y_B \mid Y_A) \;=\; \underbrace{I(Y_k;\, Y_B \mid Y_A)}_{\ge\; 0} + I(\theta;\, Y_B \mid Y_A,\, Y_k)$$ (10d)

Equating (10c) and (10d) and dropping the non-negative $I(Y_k;\, Y_B \mid Y_A)$ term:

$$I(\theta;\, Y_B \mid Y_A) \;\ge\; I(\theta;\, Y_B \mid Y_A,\, Y_k)$$ (10e)

Step 3 — assemble the bound. Apply the chain rule once more to $I(\theta;\, Y_k,\, Y_B \mid Y_A)$:

$$I(\theta;\, Y_k \mid Y_A) - I(\theta;\, Y_k \mid Y_{A'}) \;=\; I(\theta;\, Y_B \mid Y_A) - I(\theta;\, Y_B \mid Y_A,\, Y_k) \;\stackrel{(10\mathrm{e})}{\ge}\; 0 \qquad \square$$ (10f)

Gaussian specialisation. In the conjugate Gaussian model, the marginal gain from adding channel $k$ to set $A$ takes the closed form: $$I(\theta_k;\, Y_k \mid Y_A) \;=\; \tfrac{1}{2}\log\!\left(1 + \frac{\sigma_{k|A}^2}{\sigma_{E}^2}\right)$$ where $\sigma_{k|A}^2$ is the posterior variance of $\theta_k$ after observing $Y_A$. Because each experiment in $A$ can only reduce $\sigma_{k|A}^2$, the marginal gain is non-increasing in $|A|$ — the diminishing-returns property made explicit. Under the Nemhauser–Wolsey–Fisher (1978) theorem, greedy selection therefore attains at least $(1-1/e) \approx 63\%$ of the optimal portfolio value.

Experimental Design: From Channel Selection to Experimental Spec

Selecting which channel to test is a prioritization problem (EIG, EVOI, §3–4). Specifying how to run the test — which geos to treat, how much to spend, how long to run — is a distinct design problem. A geo-level Bayesian MMM provides the inputs needed to answer every design question rigorously, replacing heuristic defaults with posterior-derived quantities.

Quantities the MMM Provides

Quantity Symbol Use in test design
Per-geo channel posterior $\beta_{k,g} \mid \mu_{k,g},\,\sigma_{k,g}^2$ Identifies geos with most prior uncertainty (treated-geo ranking)
Per-geo current spend $x_{k,g}$ Anchors the saturation-curve evaluation point
Per-geo saturation curve $S_k(\cdot;\,h_{k,g},\kappa_{k,g})$ Local slope determines the optimal spend increment
Per-geo residual variance $\sigma_g^2$ Plugs directly into the standard error formula
Within-geo serial correlation $\rho_g$ Determines how duration converts to precision
Geo random-effect posteriors $\mathbf{u}_g$ Mahalanobis distance for matched-pair construction
Cross-geo posterior correlations $\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})$ Spillover and contamination diagnostics

Step 1 · Treated-Geo Selection by Information Yield

Not all geos contribute equally to resolving channel uncertainty. The information yield $I_{k,g}$ scores each geo by how much a test there would compress the posterior on $\beta_{k,g}$:

$$I_{k,g} \;=\; \frac{\sigma_{k,g}^2 \cdot \bigl[S_k'(x_{k,g})\bigr]^2}{\sigma_g^2}$$

The numerator is the product of prior uncertainty (wide posterior = more to learn) and saturation slope squared (steep curve = spend increment translates to large signal). The denominator is residual noise. A geo saturated for channel $k$ has a small $S_k'(x_{k,g})$ and therefore low yield regardless of how uncertain the posterior is. Rank geos descending by $I_{k,g}$ and select the top $G$.

Greedy selection applies here too. By the same submodularity argument as §4, greedy selection of geos by $I_{k,g}$ achieves at least $(1-1/e) \approx 63\%$ of the optimal joint information gain. Run the greedy pass once; the ranking is cheap to compute from the fitted MMM.

Step 2 · Matched-Pair Construction via Posterior Mahalanobis Distance

For each selected treatment geo $g$, choose a control geo $g'$ that minimises posterior Mahalanobis distance over the geo random effects $\mathbf{u}_g$:

$$d^2(g,\,g') \;=\; (\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})^\top \boldsymbol{\Sigma}_u^{-1}\,(\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})$$

where $\boldsymbol{\mu}_{u,g}$ is the posterior mean of geo $g$'s random-effect vector and $\boldsymbol{\Sigma}_u$ is the posterior covariance matrix of those effects. Matching on Mahalanobis distance over the full posterior — rather than on raw KPI correlation — accounts for parameter uncertainty and produces 20–40% lower residual variance in synthetic studies. Exclude any control candidate with cross-geo posterior correlation $|\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})| > 0.5$ (spillover screen; see Step 6).

Step 3 · Treatment Intensity from the Saturation Curve

The optimal spend increment for geo $g$ equates signal-to-noise across the panel:

$$\Delta\mathrm{spend}_g^* \;=\; \frac{\sigma_g}{S_k'(x_{k,g})} \cdot \sqrt{\frac{2\bigl(1+(T-1)\rho_g\bigr)}{GT} \cdot \frac{\eta}{1-\eta}} \cdot \frac{1}{\sigma_{k,g}}$$

where $\eta$ is target power, $G$ is the number of treatment geos, and $T$ is planned duration. The factor $S_k'(x_{k,g})^{-1}$ is the key innovation: a geo near saturation (small slope) requires a proportionally larger spend increment to produce the same signal as a sub-saturated geo. In practice $\Delta\mathrm{spend}_g^*$ varies 3–5× across geos for the same channel; applying a uniform percentage uplift systematically under-powers some geos and wastes budget in others.

Step 4 · Duration from Serial Correlation

Weekly KPI within a geo is serially correlated with AR(1) coefficient $\rho_g$, estimated from the MMM residuals. The corrected standard error for the difference-in-differences estimator is:

$$\mathrm{SE}(\hat\tau) \;=\; \sigma_y\,\sqrt{\tfrac{1}{n_t} + \tfrac{1}{n_c}} \cdot \sqrt{\frac{1+(T-1)\rho_g}{T}}$$

When $\rho_g = 0$ this reduces to the familiar $\sigma/\sqrt{T}$ form. The minimum duration $T^*$ for target power $1-\beta$ solves:

$$T^* \;=\; \frac{(z_{1-\alpha/2} + z_{1-\beta})^2\;\sigma_y^2\;(n_t^{-1}+n_c^{-1}) \,\bigl(1+(T^*-1)\rho_g\bigr)}{\Delta y^2}$$

The implicit equation is solved by one-dimensional search. For empirically common values $\rho_g \in [0.2, 0.5]$, EIG flattens after 6–10 weeks — running longer accumulates cost with minimal additional information.

Delta-method power ceiling. The ROI posterior uncertainty $\sigma_{k,g}$ also enters through the signal: $\mathrm{Var}(\Delta y) \geq \Delta b^2\,\sigma_\theta^2$. This term is independent of $T$ and sets an asymptotic power floor: no duration is long enough to recover power lost to a wide ROI posterior. The design is only feasible if the power ceiling — $\lim_{T\to\infty} \mathrm{Power}(T)$ — exceeds the target.

Step 5 · Pre-Experiment Simulation

Before committing budget, validate the proposed design by using the fitted MMM as a forward simulator:

  1. Sample $\tilde\theta$ from the posterior $p(\theta \mid \mathcal{D})$.
  2. Generate simulated KPI for each geo under the proposed spend schedule, adding noise $\mathcal{N}(0,\sigma_g^2)$ scaled by the serial-correlation factor $\sqrt{1+(T-1)\rho_g}$.
  3. Fit the planned estimator (DiD, SCM, etc.) to the simulated data and record whether it reaches significance at $\alpha$.
  4. Repeat 500–2,000 times; the fraction of significant replicates is the simulated power.

Simulated power catches issues — skewed KPI distributions, heterogeneous geo variance, adstock contamination — that closed-form approximations miss. If simulated power falls short, adjust $\Delta\mathrm{spend}_g^*$, add geos, or extend duration and re-simulate.

Step 6 · Spillover and Contamination Diagnostics

Cross-geo spillover (national TV halo, retargeting bleed across DMAs) deflates the estimated lift by partially treating the control. Diagnose contamination with the cross-geo posterior correlation:

$$r_{g,g'} \;=\; \mathrm{Cor}\!\bigl(\beta_{k,g},\,\beta_{k,g'}\bigr)$$

Flag any control candidate with $|r_{g,g'}| > 0.5$ and exclude it from the donor pool. Adstock carryover introduces time-domain contamination in the first 1–2 weeks; include a wash-in buffer or model carryover explicitly in the geo-level analysis. Both diagnostics are computed from the fitted MMM posterior — no additional data collection is required.

Worked Example — CTV Campaign

Five candidate geos were scored for a Connected-TV channel. The table summarises the design output of Steps 1–6.

Geo Role $I_{k,g}$ rank Saturation slope $S'$ $\Delta\mathrm{spend}_g^*$ $T^*$ (weeks)
AtlantaTreatment1st 0.82 (sub-saturated) +18%7
PhoenixTreatment2nd 0.51 (mid-saturated) +31%9
TampaControl (Atlanta) $d^2 = 0.9$ 0%
PortlandControl (Phoenix) $d^2 = 1.4$ 0%
St. LouisExcluded3rd $r_{g,g'} = 0.71$

Atlanta's lower $\Delta\mathrm{spend}^*$ (18% vs 31%) reflects its steeper saturation slope: the same ROI uncertainty is resolved more cheaply where the response curve is still responsive. St. Louis was excluded despite ranking 3rd by $I_{k,g}$ because its cross-geo posterior correlation with Atlanta ($r = 0.71$) exceeded the spillover threshold.

Channel × geo generalisation. The same greedy submodular selection extends to portfolios of (channel, geo) pairs. Score every pair by $I_{k,g}$, apply the Mahalanobis exclusion and spillover screen, then select greedily up to the experiment budget. The $(1-1/e)$ approximation bound from §4 applies to the joint selection problem without modification.

Calibration: Using Experiments to Anchor the Next MMM

Calibration is the process of using experimental causal estimates to anchor the MMM's channel parameters so that model-implied ROIs are consistent with ground truth. This can range from soft regularization (Bayesian priors informed by experiments) to hard constraints (fixing parameters at experimental point estimates - which we generally avoid). Three mechanisms cover the main use cases.

Before calibration

  • Parameters identified by observational variation alone
  • Posteriors driven mostly by structural assumptions
  • Attribution may reflect media collinearity, not causality
  • ROI rankings unstable across model specifications

After calibration

  • Experimental likelihoods constrain key channel parameters
  • Posteriors tighter on tested channels; partial pooling propagates
  • Attribution anchored to causal reality for tested channels
  • Remaining uncertainty quantified, not hidden
M1 - Soft prior

Last quarter's posterior becomes this quarter's prior

The most principled approach: use the experimental posterior as the prior on the corresponding MMM channel parameter. The MMM likelihood pulls unconstrained channels in fitting; the informed prior anchors the calibrated channel near its experimentally-identified value.

M2 - Likelihood augmentation

Treat the experiment as one more data point

Add the experimental estimate as an additional term in the joint likelihood. Cleaner when the experiment overlaps the MMM training window. Requires a mapping $g(\theta_k)$ from MMM parameters to the implied geo-lift effect size.

M3 - Hierarchical pooling

Tested channels pull untested siblings via shared structure

Information from a tested channel propagates to untested channels via a shared category-level hyperparameter. An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate.

Mechanism 1 - Soft calibration via Bayesian priors

The experiment on channel $k$ produces $p(\tau_k \mid \hat{y}_k)$. Pass that distribution as the prior on the corresponding MMM parameter:

$$p(\tau_k \mid \hat{y}_k) \;\propto\; p(\hat{y}_k \mid \tau_k) \cdot p(\tau_k)$$ $$p(\theta_k) \;\leftarrow\; p(\tau_k \mid \hat{y}_k) \quad \text{(with appropriate scale transform)}$$ $$p(\boldsymbol{\theta} \mid y, X) \;\propto\; p(y \mid X, \boldsymbol{\theta}) \cdot p(\theta_k \mid \hat{y}_k) \cdot p(\boldsymbol{\theta}_{-k})$$ (11)

The scale transform matters. A geo-lift $\hat{\tau}$ is in outcome units per spend-dollar in the test window, while an MMM coefficient may be in standardized units. Convert carefully and document the conversion rule in the calibration spec.

Mechanism 2 - Likelihood augmentation

$$\log p(y, \hat{y}_k \mid X, \boldsymbol{\theta}) \;=\; \log p(y \mid X, \boldsymbol{\theta}) \;+\; \log p(\hat{y}_k \mid \theta_k)$$ $$p(\hat{y}_k \mid \theta_k) \;=\; \mathcal{N}\!\left(g(\theta_k),\; \sigma_{\text{exp},k}^2\right)$$ (12)

where $g(\theta_k)$ maps MMM channel parameters to the implied geo-lift effect size - accounting for the spend delta in the test, the saturation curve evaluated at the test spend level, and any adstock effects in the experiment window. Constructing $g(\cdot)$ correctly is the hardest part of this mechanism; miscalibration introduces systematic bias.

Mechanism 3 - Hierarchical pooling

$$\theta_k \mid \mu_{\text{cat}}, \sigma_{\text{cat}} \;\sim\; \mathcal{N}\!\left(\mu_{\text{cat}[k]},\; \sigma_{\text{cat}[k]}^2\right)$$ (13)

where $\mathrm{cat}[k]$ is the media category of channel $k$ (e.g., video, search, social). An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate. Pooling strength is controlled by $\sigma_{\text{cat}}$ via a hyperprior - tighter $\sigma_{\text{cat}}$ pools more strongly.

Calibration validity checks

When experiment and MMM disagree

If the experimental estimate falls outside the MMM posterior's plausible range, do not automatically trust either side. Diagnose first: (1) Did the experiment have sufficient power? (2) Was there contamination or geo spillover? (3) Is the MMM response curve evaluated at the correct spend level? (4) Are there temporal confounders in the experiment window? Work through this checklist before resolving the conflict. The disagreement itself is documented evidence of model fragility - which is information you want.

The Virtuous Cycle: Six-Step Orbit

The full adaptive loop treats MMM calibration as a sequential Bayesian inference problem across measurement cycles. Each cycle has six steps - three modeling steps that happen against historical data, and three field steps that happen in the market.

Fit baseline MMM

Run the Bayesian MMM on historical data with weakly informative priors. Extract posterior $p(\boldsymbol{\theta})$ over all channel ROI parameters. Flag channels with $\sigma_k$ above threshold.

Score EIG & EVOI

Use the T0 posterior as the prior. Estimate $\sigma_{\text{exp}}$ per channel given geo footprint. Produce the priority grid and select the top-K portfolio subject to operational constraints.

Run experiments (pre-registered)

Execute the selected geo-lift / matched-market tests with pre-specified designs. Log all deviations. Apply ITT analysis by default. Produce $p(\tau_k \mid \hat{y}_k)$ per tested channel.

Calibrate the next MMM

Refit via informed priors (M1) or augmented likelihood (M2). Document the degree of belief revision: pre- vs post-calibration mean and width per channel.

Allocate from the calibrated posterior

Make budget decisions from the calibrated MMM. Tag each line as experiment-backed or model-only. Report confidence tiers - distinguish evidence quality from point estimates.

Re-score for next cycle

Calibrated channels now have tighter posteriors - their EIG and EVOI drop. Recompute the priority grid with updated beliefs. Previously deprioritized channels may now rise. Begin again.

The single-sentence summary

The MMM tells us where to look; the experiments tell us what's actually there; the next MMM bakes in what we learned. Each loop tightens the parts of the picture that matter most for budget decisions.

Information decay & re-experimentation triggers

Experimental calibration has a shelf life. An experiment from 18 months ago, run under different competitive conditions, spend levels, or creative strategies, may no longer accurately represent current channel effectiveness. Model an exponential decay of experimental information over time:

$$\sigma_{k,\text{eff}}^2(t) \;=\; \sigma_{k,\text{post}}^2 \cdot \exp(\lambda_{\text{decay}} \cdot t)$$ (14)

where $\lambda_{\text{decay}}$ is calibrated from observed year-over-year MMM coefficient stability. As $\sigma_{k,\text{eff}}^2$ grows, EIG and EVOI for re-experimentation rise again - creating a principled re-experimentation schedule rather than relying on calendar-based intuitions. Decay rates differ by channel: fast-moving digital decays in 6-12 months; stable broadcast can hold 18-24.

Adaptive stopping criteria

A channel exits the active experimentation pool when all three conditions hold:

$$\mathrm{EIG}(k, d^*) < \varepsilon_{\text{EIG}} \quad \text{AND} \quad \mathrm{EVOI}(k) < \varepsilon_{\text{EVOI}} \quad \text{AND} \quad \mathrm{age}(\text{last\_exp}_k) < T_{\text{refresh}}$$ (15)

$T_{\text{refresh}}$ depends on media market dynamics, creative/targeting shifts, and competitive changes. Typical values: 12-24 months for stable channels, 6-12 months for fast-moving digital surfaces.

Why It Compounds

The point of running this loop is not any single experiment - it is the compounding of belief updates over successive cycles. A useful way to see the value is to plot the trajectories of four quantities that each cycle is trying to move:

  1. MMM posterior precision - the 95% credible interval width on each channel's ROI. As experiments calibrate channels, widths contract.
  2. Budget allocation - the share of total spend going to each channel. As ROIs become better-identified, allocations migrate toward the (unknown but increasingly well-estimated) optimum.
  3. Misallocation cost - the gap between current spend and the spend that would maximize portfolio outcome under the current posterior. The running tab of wasted dollars per week.
  4. Decision efficiency - the fraction of the theoretically optimal portfolio return that is actually captured, averaged over the current posterior. Rises as sigma shrinks and allocation converges to the optimum.

For a representative five-channel portfolio over five quarterly cycles, the loop produces the kind of trajectory below. The exact magnitudes depend on starting MMM uncertainty and spend asymmetry, but the shape - sharp early gains followed by diminishing returns - is structural.

−92%
Weekly misallocation cost over 4 cycles (calibrated path vs. baseline)
+25pp
Decision efficiency gain over the same window (% of optimal return captured)
~3
Experiments per cycle - submodularity caps useful additions
0x
New measurement contracts required - same vendor stack, better routing

How to read the four panels

Posterior contraction is the epistemic outcome — what we know better. Allocation migration is the operational outcome — what we do differently as a result. Misallocation cost is the economic outcome — what poor knowledge was costing us. Decision efficiency is the strategic outcome — the share of the theoretically optimal return we are actually capturing. A defensible measurement program connects all four and reports them quarterly.

Frequentist Tools, Bayesian Framing

Most geo-lift estimators in active production use are frequentist by construction: difference-in-differences with two-way fixed effects, synthetic control, augmented synthetic control, and time-based regression. Stakeholders speak that language. Pre-registration documents demand it. None of this is in tension with the Bayesian framework above.

The mental model

Run frequentist estimators because they are well-understood, defensible, and pre-registrable. Interpret their outputs as Gaussian likelihoods feeding into the conjugate update of Eq. (2). Stakeholders see CIs and p-values; the MMM team sees a precision-weighted posterior. Both are correct views of the same numbers.

The Bridge Equation

A frequentist point estimate $\hat{\beta}_k$ paired with a standard error $\mathrm{SE}_k$ is - under mild assumptions - a Gaussian likelihood. That likelihood plugs straight into the conjugate update alongside the MMM prior:

$$\hat{\beta}_k \mid \theta_k \;\sim\; \mathcal{N}(\theta_k,\; \mathrm{SE}_k^2) \;\;\Longrightarrow\;\; \theta_k \mid \hat{\beta}_k \;\sim\; \mathcal{N}(\mu_{\text{post}},\, \sigma_{\text{post}}^2)$$ $$\sigma_{\text{post}}^{-2} \;=\; \sigma_0^{-2} + \mathrm{SE}_k^{-2}, \qquad \mu_{\text{post}} \;=\; \sigma_{\text{post}}^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{\hat{\beta}_k}{\mathrm{SE}_k^2}\right)$$ (16)

The translation table is short: estimator output $\hat{\beta}_k$ becomes the likelihood mean; $\mathrm{SE}_k^2$ becomes the likelihood variance $\sigma_e^2$; the MDE relates to the achievable $\sigma_{\text{exp}}$ through the design's power calculation. Every frequentist estimator the measurement team already runs - DiD, SCM, ASC, time-based regression, BSTS / CausalImpact - feeds the same loop without anyone switching teams.

When the bridge isn't enough - small samples, multiple competing estimators, sequential monitoring, or borrowing strength across markets - go fully Bayesian on the experiment side too. Tools include BEST (robust Bayesian estimation), HDI + ROPE for decision-theoretic readouts, hierarchical / multilevel models, Bayesian synthetic control, and sequential Bayesian inference (no alpha-spending required). These all output posteriors that plug into Eq. (11) directly.

Implementation: A 90-Day Rollout

A staged rollout with explicit decision gates. Phase 1 is a 90-day pilot that delivers decision-grade evidence about whether to continue. The steady-state quarterly cadence is what comes after.

Phase 1
Foundation
Weeks 1-6
  • Audit existing MMM & geo-lift artifacts
  • Build Bayesian MMM in parallel
  • Posterior predictive checks pass
  • Side-by-side OLS vs. Bayes readout
Phase 2
First priority cycle
Weeks 6-8
  • Compute first EIG/EVOI grid
  • Produce 2x2 priority map
  • Stakeholder review & selection
  • Pre-register top 2-3 experiments
Phase 3
Calibration loop
Weeks 9-13
  • Run pre-registered experiments
  • Refit MMM with calibration priors
  • Pre/post allocation comparison
  • Phase 1 go/no-go gate
Phase 4
Steady state
Quarterly · ongoing
  • Quarterly priority + calibration cycle
  • OLS pipeline retired after 2 cycles
  • Re-experimentation triggers fire
  • Confidence-tier reporting standard

Prerequisites

Success Metrics

Metric Definition Target Cadence
Holdout MAPE Out-of-sample predictive accuracy of the calibrated MMM ≤ OLS baseline Each cycle
Posterior contraction Average reduction in $\sigma_k$ for tested channels post-calibration ≥ 30% Each cycle
Misallocation Δ Weekly misallocation cost vs. last cycle Falling, then flat Each cycle
Portfolio mROI Marginal return of the next dollar across the portfolio Rising over 4 cycles Annual review
Calibration coverage Fraction of spend in experiment-backed channels (vs. model-only) ≥ 60% by Year 1 Each cycle
Stakeholder fluency Sponsor and planning team can explain priority map & confidence tiers in their own words Yes by Cycle 3 Cycle 3 review

Anti-Patterns & Failure Modes

Six failure modes seen in similar programs. Calling them out up front so we recognize them when they happen.

The most insidious failure mode

Stakeholders accept the new framework but quietly continue making allocation decisions on instinct, treating the priority map as a research artifact rather than a budget recommendation. This is invisible from the inside - the team produces the deliverables, gets compliments, and nothing gets used. Detection: at the second-cycle review, ask the sponsor to point at a specific budget reallocation that happened because of a calibrated posterior. If they can't, the program isn't actually running yet - only the appearance of it is.

Closing Principle

The MMM and geo-lift experiments are not competing measurement paradigms - they are complementary nodes in a Bayesian inference graph. The MMM provides a coherent joint model of all channels with full coverage; experiments provide local causal identification for high-priority channels. Information flows are bidirectional: the MMM shapes experiment design (via EIG/EVOI prioritization), and experiments calibrate the MMM (via informed priors or augmented likelihoods). Over successive cycles, this adaptive loop systematically contracts the uncertainty that actually matters for budget decisions.

Where to go next

To understand the math foundations in more depth, see the Bayesian Workflow and Causal Inference guides. To see how the MMM is implemented, see the Modeling Guide and the Technical Guide. To see what the framework produces, see the Demos & Reports.