Closed-Loop Measurement & Calibration
A comprehensive guide to integrating measurement into Marketing Mix Modeling. The Bayesian MMM and geo-lift experiments are not competing paradigms - they are complementary nodes in one inference graph. This page lays out how to wire them into a single self-correcting cycle that compounds learning over time.
Why this page exists
Most teams already run an MMM and a geo-lift program - and treat them as separate budgets, separate timelines, and separate answers. The closed-loop framework wires them together: the MMM chooses which experiments to run, and the experiments calibrate the next MMM. Each cycle tightens the parts of the picture that matter most for budget decisions.
The Two-Paradigm Problem
The MMM is the system of record for "what is each channel worth." It produces a clean ROI per channel, and the planning team allocates budget against those numbers. Separately, the team runs geo-lift experiments - paid search holdouts, CTV market tests, audio matched-market trials - each producing its own causal estimate. Sometimes the two agree; sometimes they don't. When they disagree, someone picks the source they trust more for that channel and moves on.
MMM System
"CTV ROI 1.4x"
Tight credible interval. Joint over all channels. Identified by observational variation - which means entangled when channels move together.
Experiment System
"CTV ROI 2.1x"
Wider interval. Single-channel scope. Causally identified by the randomization itself - so the estimate isn't entangled with the rest of the marketing plan.
The structural problem is not that either system is bad - both are well-built. The problem is that neither is wired into the other. The MMM doesn't know which channels need experimental backup; the experiments don't feed back into the next MMM fit. So the team keeps paying for both and keeps accepting whichever number the analyst trusted that week.
A tight number is not a true number
An MMM produces tight intervals when the model is well-regularized and the historical data tells a consistent story. But "consistent" is not the same as "correct" - when channels move in lockstep (national TV and digital flighted together; promo periods overlapping with seasonality), the model cannot tell them apart, and a regularizing prior produces a tight number anyway, anchored to an arbitrary point along an unidentified ridge. Confidently wrong is a real failure mode.
Two Complementary Objectives
Once you accept that experiments are the calibration tool, you still face the question of which experiments. Two distinct objectives emerge - they are correlated but not identical, and a good framework integrates both.
Epistemic - reduce uncertainty
- Quantify posterior entropy per channel
- Pick experiments that maximally collapse the posterior
- Prioritize where beliefs are least grounded
- Metric: Expected Information Gain (EIG)
Instrumental - improve decisions
- Map ROI uncertainty to budget decision quality
- Prioritize where errors are most costly
- Quantify the dollar value of resolving each uncertainty
- Metric: Expected Value of Information (EVOI)
A boutique channel may have huge posterior variance (high epistemic priority) but receive 1% of spend (low decision stakes). A workhorse channel may have moderate variance but receive 30% of spend, making any uncertainty extraordinarily costly. EIG and EVOI together produce a priority map that respects both.
Bayesian Foundations: Prior, Likelihood, Posterior
Bayesian inference is the discipline of updating beliefs in the face of new evidence. Three objects appear repeatedly in this guide:
- Prior $p(\theta)$ - what you believe about an unknown parameter $\theta$ before seeing experimental data. For us, this is what the MMM tells us about a channel's ROI.
- Likelihood $p(\hat{y} \mid \theta)$ - the probability of observing the experimental outcome $\hat{y}$ if the truth were $\theta$.
- Posterior $p(\theta \mid \hat{y})$ - the updated belief about $\theta$ after seeing $\hat{y}$.
For the rest of this guide, $\theta_k$ denotes a channel-level causal parameter (typically ROI or elasticity) for channel $k$, and $\hat{y}_k$ denotes the noisy estimate produced by an experiment. The MMM provides $p(\theta_k)$. The experiment provides $p(\hat{y}_k \mid \theta_k)$. The product is what we want.
The Gaussian Conjugate Update
When the prior is Gaussian and the experiment likelihood is Gaussian (the standard assumption for geo-lift difference-in-differences), the math simplifies dramatically. With prior $\theta \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and likelihood $\hat{y} \mid \theta \sim \mathcal{N}(\theta, \sigma_e^2)$, the posterior is also Gaussian:
Precision (the inverse of variance) is additive. The posterior precision is the prior precision plus the experiment's precision. The posterior mean is a precision-weighted average of the prior mean and the experimental estimate - whichever is more precise pulls the posterior more strongly toward itself.
Equation (2) is the workhorse for almost everything that follows. EIG, calibration, and the bridge from frequentist estimates all reduce to applications of this single update.
Decision Structure
Let $\theta_k$ denote the true causal ROI for channel $k \in \{1, \ldots, K\}$. The MMM produces a joint posterior $p(\boldsymbol{\theta}) = p(\theta_1, \ldots, \theta_K \mid y, X, \mathcal{M})$. The planner selects an experimental portfolio $A \subseteq \{1, \ldots, K\}$ together with a design $d_k$ for each chosen channel:
Each experiment produces a noisy estimate of the true causal effect. For a geo-lift difference-in-differences design, this is approximately Gaussian:
where $s^2_{\text{geo},k}$ is the residual geo-week variance from the pre-period, $n$ is geos per arm, and $T$ is the test duration in weeks. More geos, longer tests, and bigger spend deltas all shrink experimental noise.
Expected Information Gain (EIG)
For a given channel $k$ and design $d$, the Expected Information Gain is the expected KL divergence from the prior to the updated posterior, integrated over what experimental outcome we might see. Equivalently, it is the mutual information between the unknown parameter and the experimental outcome:
Under the Gaussian-Gaussian setup, EIG has a remarkably clean closed form. Because the posterior variance does not depend on the realized value of $\hat{y}$, the expected entropy reduction simplifies to:
Read this formula carefully
Equation (6) is the single most useful formula in this framework. Information gain depends only on a signal-to-noise ratio: the channel's prior variance divided by the experiment's noise variance. A noisy experiment on an uncertain channel can yield as much information as a precise experiment on a moderately uncertain one. Diminishing returns set in fast - going from a 1x to 4x ratio yields ~1 bit; going from 4x to 16x yields only one more.
Practical inputs required
- Extract MMM marginals - for each channel $k$, compute $\sigma_k$ from the posterior samples of its ROI coefficient. Use the marginal standard deviation, not the standard error of the posterior mean.
- Estimate experimental noise per design - estimate $\sigma_{\text{exp},k}^2(d)$ from a power analysis fit on the pre-period geo-week panel. This links design choices to achievable precision.
- Compute the EIG grid - sweep across channels x designs, apply Eq. (6), and produce a ranked grid of $(k, d)$ pairs.
- Constrain & select - apply geo non-overlap, budget caps, minimum duration, and non-interference. Solve the constrained selection problem; greedy is usually sufficient (see Submodularity).
For non-Gaussian posteriors - hierarchical models, non-conjugate priors, skewed contributions - a nested Monte Carlo estimator works in $O(N^2)$ samples. For prioritization purposes, the Gaussian approximation in Eq. (6) is almost always sufficient. Monte Carlo refinement is reserved for borderline channels.
Expected Value of Information (EVOI)
EIG treats every bit of uncertainty reduction as equally valuable. But uncertainty about a channel receiving 1% of spend is far less costly than equivalent uncertainty about a channel receiving 40% of spend. EVOI prices uncertainty in dollars by accounting for the asymmetric cost of being wrong.
Define the downstream action as a budget allocation $\mathbf{b} = (b_1, \ldots, b_K)$ with $\sum_k b_k = B$. The organization picks $\mathbf{b}^*$ based on its current beliefs and earns utility $U(\mathbf{b}, \boldsymbol{\theta}) = \sum_k \theta_k \cdot f(b_k)$, where $f(\cdot)$ is a (usually concave) channel response function.
Closed-form EVOI is generally hard, but a useful linearized approximation drops out when reallocations only happen if the experiment changes the ordering of channels - i.e., if the posterior mean crosses a decision threshold:
EVOI is large when (i) the budget at stake is big, (ii) the channel takes a lot of that budget, (iii) the experiment is likely to substantially shift the posterior mean, and (iv) there is a real chance the shift will cross a decision threshold (e.g., flip from "underspend" to "overspend").
Structural drivers
- High prior uncertainty ($\sigma_k$ large) - wide credible intervals mean large potential for belief revision, increasing EVOI.
- High spend share ($s_k$ large) - channels receiving big budget fractions have higher stakes; errors cost more.
- Decision threshold proximity - when the posterior mean is near a budget reallocation threshold (e.g., borderline ROI = 1.0), even small uncertainty reductions have high value.
- Convex response (saturation) - when the response function is concave, correcting misallocation has compounding returns.
EVOI requires an explicit decision rule
EVOI is only as good as the assumed rule for translating posteriors into budgets. Make it explicit. A simple rule: reallocate proportional to posterior mean ROI, subject to floor/ceiling constraints. A richer rule: optimizer with saturation curves and channel-level minimums. Document whichever rule you use - the EVOI computation depends on it.
The 2x2 Priority Map
Combine EIG and EVOI into a single priority score and a stakeholder-friendly visualization:
where $\widetilde{\cdot}$ denotes min-max normalization to $[0, 1]$, $\lambda$ trades off epistemic vs. instrumental value, $\gamma$ penalizes expensive experiments, and $\rho$ penalizes operational risk.
EVOI →
Uncertain, low-spend
Useful for model calibration and building geo-lift infrastructure, but lower urgency. Schedule when capacity allows.
Uncertain, high-spend
Run these first. Both epistemic and instrumental value are maximized. The headline targets every cycle.
Well-known, low-spend
Skip experiments. Accept the MMM estimate. Revisit only if spend materially increases or the model changes.
Well-known, high-spend
The MMM is already well-identified here. Re-test only if model specification changes or spend levels shift dramatically.
Submodularity & Greedy Selection
A non-obvious but enormously useful fact: the EIG of a set of experiments is a submodular function of the set. Submodularity means diminishing returns - adding the same experiment to a small portfolio yields more information than adding it to a large one. Formally, for sets $A \subseteq A'$:
Submodularity has a major practical payoff: greedy selection is provably near-optimal. The Nemhauser-Wolsey-Fisher (1978) result says greedy attains at least $(1 - 1/e) \approx 63\%$ of the optimum under cardinality constraints. The pseudocode is trivial:
# Greedy portfolio construction
A = []
while len(A) < C_ops:
feasible = [k for k in channels if is_feasible(k, A)]
if not feasible: break
k_star = argmax(feasible, key=lambda k: score(k, A))
A.append(k_star)
# By submodularity: EIG(A_greedy) ≥ (1 - 1/e) · EIG(A_optimal)
Practical implication: after about three experiments per cycle, additional ones add diminishing returns. The cycle naturally caps the test program at a sensible level - so any one quarter's experimental budget should be spent on the top 2-3 priorities, not spread thin.
Proof of submodularity
Let $\theta$ be the vector of channel parameters (true ROIs), and let $Y_A = \{Y_k\}_{k \in A}$ denote the experimental outcomes for a set $A$ of channels. We write $\mathrm{EIG}(A) = I(\theta;\, Y_A)$ for the mutual information between the parameters and the outcomes of experiment set $A$. The claim is the diminishing returns property: for any $A \subseteq A'$ and any $k \notin A'$,
Step 1 — marginal gain equals conditional MI. By the chain rule of mutual information, $I(\theta;\, Y_{A \cup \{k\}}) = I(\theta;\, Y_A) + I(\theta;\, Y_k \mid Y_A)$, so the marginal gain from adding $k$ to set $A$ is exactly the conditional mutual information:
Step 2 — the key inequality. Write $B = A' \setminus A$, so $Y_{A'} = (Y_A, Y_B)$. Apply the chain rule to $I(\theta,\, Y_k;\; Y_B \mid Y_A)$ in two ways:
The second term vanishes because experimental outcomes for distinct channels are conditionally independent given $\theta$: once we know the true parameters, the outcome of experiment $k$ carries no information about the outcomes of experiments $B$. The same joint quantity expands the other way as:
Equating (10c) and (10d) and dropping the non-negative $I(Y_k;\, Y_B \mid Y_A)$ term:
Step 3 — assemble the bound. Apply the chain rule once more to $I(\theta;\, Y_k,\, Y_B \mid Y_A)$:
Gaussian specialisation. In the conjugate Gaussian model, the marginal gain from adding channel $k$ to set $A$ takes the closed form: $$I(\theta_k;\, Y_k \mid Y_A) \;=\; \tfrac{1}{2}\log\!\left(1 + \frac{\sigma_{k|A}^2}{\sigma_{E}^2}\right)$$ where $\sigma_{k|A}^2$ is the posterior variance of $\theta_k$ after observing $Y_A$. Because each experiment in $A$ can only reduce $\sigma_{k|A}^2$, the marginal gain is non-increasing in $|A|$ — the diminishing-returns property made explicit. Under the Nemhauser–Wolsey–Fisher (1978) theorem, greedy selection therefore attains at least $(1-1/e) \approx 63\%$ of the optimal portfolio value.
Experimental Design: From Channel Selection to Experimental Spec
Selecting which channel to test is a prioritization problem (EIG, EVOI, §3–4). Specifying how to run the test — which geos to treat, how much to spend, how long to run — is a distinct design problem. A geo-level Bayesian MMM provides the inputs needed to answer every design question rigorously, replacing heuristic defaults with posterior-derived quantities.
Quantities the MMM Provides
| Quantity | Symbol | Use in test design |
|---|---|---|
| Per-geo channel posterior | $\beta_{k,g} \mid \mu_{k,g},\,\sigma_{k,g}^2$ | Identifies geos with most prior uncertainty (treated-geo ranking) |
| Per-geo current spend | $x_{k,g}$ | Anchors the saturation-curve evaluation point |
| Per-geo saturation curve | $S_k(\cdot;\,h_{k,g},\kappa_{k,g})$ | Local slope determines the optimal spend increment |
| Per-geo residual variance | $\sigma_g^2$ | Plugs directly into the standard error formula |
| Within-geo serial correlation | $\rho_g$ | Determines how duration converts to precision |
| Geo random-effect posteriors | $\mathbf{u}_g$ | Mahalanobis distance for matched-pair construction |
| Cross-geo posterior correlations | $\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})$ | Spillover and contamination diagnostics |
Step 1 · Treated-Geo Selection by Information Yield
Not all geos contribute equally to resolving channel uncertainty. The information yield $I_{k,g}$ scores each geo by how much a test there would compress the posterior on $\beta_{k,g}$:
The numerator is the product of prior uncertainty (wide posterior = more to learn) and saturation slope squared (steep curve = spend increment translates to large signal). The denominator is residual noise. A geo saturated for channel $k$ has a small $S_k'(x_{k,g})$ and therefore low yield regardless of how uncertain the posterior is. Rank geos descending by $I_{k,g}$ and select the top $G$.
Step 2 · Matched-Pair Construction via Posterior Mahalanobis Distance
For each selected treatment geo $g$, choose a control geo $g'$ that minimises posterior Mahalanobis distance over the geo random effects $\mathbf{u}_g$:
where $\boldsymbol{\mu}_{u,g}$ is the posterior mean of geo $g$'s random-effect vector and $\boldsymbol{\Sigma}_u$ is the posterior covariance matrix of those effects. Matching on Mahalanobis distance over the full posterior — rather than on raw KPI correlation — accounts for parameter uncertainty and produces 20–40% lower residual variance in synthetic studies. Exclude any control candidate with cross-geo posterior correlation $|\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})| > 0.5$ (spillover screen; see Step 6).
Step 3 · Treatment Intensity from the Saturation Curve
The optimal spend increment for geo $g$ equates signal-to-noise across the panel:
where $\eta$ is target power, $G$ is the number of treatment geos, and $T$ is planned duration. The factor $S_k'(x_{k,g})^{-1}$ is the key innovation: a geo near saturation (small slope) requires a proportionally larger spend increment to produce the same signal as a sub-saturated geo. In practice $\Delta\mathrm{spend}_g^*$ varies 3–5× across geos for the same channel; applying a uniform percentage uplift systematically under-powers some geos and wastes budget in others.
Step 4 · Duration from Serial Correlation
Weekly KPI within a geo is serially correlated with AR(1) coefficient $\rho_g$, estimated from the MMM residuals. The corrected standard error for the difference-in-differences estimator is:
When $\rho_g = 0$ this reduces to the familiar $\sigma/\sqrt{T}$ form. The minimum duration $T^*$ for target power $1-\beta$ solves:
The implicit equation is solved by one-dimensional search. For empirically common values $\rho_g \in [0.2, 0.5]$, EIG flattens after 6–10 weeks — running longer accumulates cost with minimal additional information.
Step 5 · Pre-Experiment Simulation
Before committing budget, validate the proposed design by using the fitted MMM as a forward simulator:
- Sample $\tilde\theta$ from the posterior $p(\theta \mid \mathcal{D})$.
- Generate simulated KPI for each geo under the proposed spend schedule, adding noise $\mathcal{N}(0,\sigma_g^2)$ scaled by the serial-correlation factor $\sqrt{1+(T-1)\rho_g}$.
- Fit the planned estimator (DiD, SCM, etc.) to the simulated data and record whether it reaches significance at $\alpha$.
- Repeat 500–2,000 times; the fraction of significant replicates is the simulated power.
Simulated power catches issues — skewed KPI distributions, heterogeneous geo variance, adstock contamination — that closed-form approximations miss. If simulated power falls short, adjust $\Delta\mathrm{spend}_g^*$, add geos, or extend duration and re-simulate.
Step 6 · Spillover and Contamination Diagnostics
Cross-geo spillover (national TV halo, retargeting bleed across DMAs) deflates the estimated lift by partially treating the control. Diagnose contamination with the cross-geo posterior correlation:
Flag any control candidate with $|r_{g,g'}| > 0.5$ and exclude it from the donor pool. Adstock carryover introduces time-domain contamination in the first 1–2 weeks; include a wash-in buffer or model carryover explicitly in the geo-level analysis. Both diagnostics are computed from the fitted MMM posterior — no additional data collection is required.
Worked Example — CTV Campaign
Five candidate geos were scored for a Connected-TV channel. The table summarises the design output of Steps 1–6.
| Geo | Role | $I_{k,g}$ rank | Saturation slope $S'$ | $\Delta\mathrm{spend}_g^*$ | $T^*$ (weeks) |
|---|---|---|---|---|---|
| Atlanta | Treatment | 1st | 0.82 (sub-saturated) | +18% | 7 |
| Phoenix | Treatment | 2nd | 0.51 (mid-saturated) | +31% | 9 |
| Tampa | Control (Atlanta) | — | $d^2 = 0.9$ | 0% | — |
| Portland | Control (Phoenix) | — | $d^2 = 1.4$ | 0% | — |
| St. Louis | Excluded | 3rd | $r_{g,g'} = 0.71$ | — | — |
Atlanta's lower $\Delta\mathrm{spend}^*$ (18% vs 31%) reflects its steeper saturation slope: the same ROI uncertainty is resolved more cheaply where the response curve is still responsive. St. Louis was excluded despite ranking 3rd by $I_{k,g}$ because its cross-geo posterior correlation with Atlanta ($r = 0.71$) exceeded the spillover threshold.
Calibration: Using Experiments to Anchor the Next MMM
Calibration is the process of using experimental causal estimates to anchor the MMM's channel parameters so that model-implied ROIs are consistent with ground truth. This can range from soft regularization (Bayesian priors informed by experiments) to hard constraints (fixing parameters at experimental point estimates - which we generally avoid). Three mechanisms cover the main use cases.
Before calibration
- Parameters identified by observational variation alone
- Posteriors driven mostly by structural assumptions
- Attribution may reflect media collinearity, not causality
- ROI rankings unstable across model specifications
After calibration
- Experimental likelihoods constrain key channel parameters
- Posteriors tighter on tested channels; partial pooling propagates
- Attribution anchored to causal reality for tested channels
- Remaining uncertainty quantified, not hidden
Last quarter's posterior becomes this quarter's prior
The most principled approach: use the experimental posterior as the prior on the corresponding MMM channel parameter. The MMM likelihood pulls unconstrained channels in fitting; the informed prior anchors the calibrated channel near its experimentally-identified value.
Treat the experiment as one more data point
Add the experimental estimate as an additional term in the joint likelihood. Cleaner when the experiment overlaps the MMM training window. Requires a mapping $g(\theta_k)$ from MMM parameters to the implied geo-lift effect size.
Tested channels pull untested siblings via shared structure
Information from a tested channel propagates to untested channels via a shared category-level hyperparameter. An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate.
Mechanism 1 - Soft calibration via Bayesian priors
The experiment on channel $k$ produces $p(\tau_k \mid \hat{y}_k)$. Pass that distribution as the prior on the corresponding MMM parameter:
The scale transform matters. A geo-lift $\hat{\tau}$ is in outcome units per spend-dollar in the test window, while an MMM coefficient may be in standardized units. Convert carefully and document the conversion rule in the calibration spec.
Mechanism 2 - Likelihood augmentation
where $g(\theta_k)$ maps MMM channel parameters to the implied geo-lift effect size - accounting for the spend delta in the test, the saturation curve evaluated at the test spend level, and any adstock effects in the experiment window. Constructing $g(\cdot)$ correctly is the hardest part of this mechanism; miscalibration introduces systematic bias.
Mechanism 3 - Hierarchical pooling
where $\mathrm{cat}[k]$ is the media category of channel $k$ (e.g., video, search, social). An experiment on CTV updates $\mu_{\text{video}}$, which partially informs the OLV estimate. Pooling strength is controlled by $\sigma_{\text{cat}}$ via a hyperprior - tighter $\sigma_{\text{cat}}$ pools more strongly.
Calibration validity checks
- Scale consistency - verify that the experimental $\hat{\tau}_k$ and the MMM-implied effect use the same outcome variable and time aggregation.
- Coverage check - confirm that the MMM prior 95% CI for channel $k$ covers the experimental point estimate. If not, diagnose model misspecification before calibrating.
- Quality gate - only use experiments that pass pre-specified quality checks (balance test, pre-period MAPE, sufficient MDE, no contamination evidence) for calibration inputs.
- Attribution reconciliation - after refitting, confirm total attributed conversions still match observed outcomes - calibration should not break aggregate coherence.
- Predictive validation - hold out a test period and compare pre- vs. post-calibration MAPE. Calibration should improve predictive accuracy or at minimum not degrade it.
When experiment and MMM disagree
If the experimental estimate falls outside the MMM posterior's plausible range, do not automatically trust either side. Diagnose first: (1) Did the experiment have sufficient power? (2) Was there contamination or geo spillover? (3) Is the MMM response curve evaluated at the correct spend level? (4) Are there temporal confounders in the experiment window? Work through this checklist before resolving the conflict. The disagreement itself is documented evidence of model fragility - which is information you want.
The Virtuous Cycle: Six-Step Orbit
The full adaptive loop treats MMM calibration as a sequential Bayesian inference problem across measurement cycles. Each cycle has six steps - three modeling steps that happen against historical data, and three field steps that happen in the market.
Fit baseline MMM
Run the Bayesian MMM on historical data with weakly informative priors. Extract posterior $p(\boldsymbol{\theta})$ over all channel ROI parameters. Flag channels with $\sigma_k$ above threshold.
Score EIG & EVOI
Use the T0 posterior as the prior. Estimate $\sigma_{\text{exp}}$ per channel given geo footprint. Produce the priority grid and select the top-K portfolio subject to operational constraints.
Run experiments (pre-registered)
Execute the selected geo-lift / matched-market tests with pre-specified designs. Log all deviations. Apply ITT analysis by default. Produce $p(\tau_k \mid \hat{y}_k)$ per tested channel.
Calibrate the next MMM
Refit via informed priors (M1) or augmented likelihood (M2). Document the degree of belief revision: pre- vs post-calibration mean and width per channel.
Allocate from the calibrated posterior
Make budget decisions from the calibrated MMM. Tag each line as experiment-backed or model-only. Report confidence tiers - distinguish evidence quality from point estimates.
Re-score for next cycle
Calibrated channels now have tighter posteriors - their EIG and EVOI drop. Recompute the priority grid with updated beliefs. Previously deprioritized channels may now rise. Begin again.
The single-sentence summary
The MMM tells us where to look; the experiments tell us what's actually there; the next MMM bakes in what we learned. Each loop tightens the parts of the picture that matter most for budget decisions.
Information decay & re-experimentation triggers
Experimental calibration has a shelf life. An experiment from 18 months ago, run under different competitive conditions, spend levels, or creative strategies, may no longer accurately represent current channel effectiveness. Model an exponential decay of experimental information over time:
where $\lambda_{\text{decay}}$ is calibrated from observed year-over-year MMM coefficient stability. As $\sigma_{k,\text{eff}}^2$ grows, EIG and EVOI for re-experimentation rise again - creating a principled re-experimentation schedule rather than relying on calendar-based intuitions. Decay rates differ by channel: fast-moving digital decays in 6-12 months; stable broadcast can hold 18-24.
Adaptive stopping criteria
A channel exits the active experimentation pool when all three conditions hold:
$T_{\text{refresh}}$ depends on media market dynamics, creative/targeting shifts, and competitive changes. Typical values: 12-24 months for stable channels, 6-12 months for fast-moving digital surfaces.
Why It Compounds
The point of running this loop is not any single experiment - it is the compounding of belief updates over successive cycles. A useful way to see the value is to plot the trajectories of four quantities that each cycle is trying to move:
- MMM posterior precision - the 95% credible interval width on each channel's ROI. As experiments calibrate channels, widths contract.
- Budget allocation - the share of total spend going to each channel. As ROIs become better-identified, allocations migrate toward the (unknown but increasingly well-estimated) optimum.
- Misallocation cost - the gap between current spend and the spend that would maximize portfolio outcome under the current posterior. The running tab of wasted dollars per week.
- Decision efficiency - the fraction of the theoretically optimal portfolio return that is actually captured, averaged over the current posterior. Rises as sigma shrinks and allocation converges to the optimum.
For a representative five-channel portfolio over five quarterly cycles, the loop produces the kind of trajectory below. The exact magnitudes depend on starting MMM uncertainty and spend asymmetry, but the shape - sharp early gains followed by diminishing returns - is structural.
How to read the four panels
Posterior contraction is the epistemic outcome — what we know better. Allocation migration is the operational outcome — what we do differently as a result. Misallocation cost is the economic outcome — what poor knowledge was costing us. Decision efficiency is the strategic outcome — the share of the theoretically optimal return we are actually capturing. A defensible measurement program connects all four and reports them quarterly.
Frequentist Tools, Bayesian Framing
Most geo-lift estimators in active production use are frequentist by construction: difference-in-differences with two-way fixed effects, synthetic control, augmented synthetic control, and time-based regression. Stakeholders speak that language. Pre-registration documents demand it. None of this is in tension with the Bayesian framework above.
The mental model
Run frequentist estimators because they are well-understood, defensible, and pre-registrable. Interpret their outputs as Gaussian likelihoods feeding into the conjugate update of Eq. (2). Stakeholders see CIs and p-values; the MMM team sees a precision-weighted posterior. Both are correct views of the same numbers.
The Bridge Equation
A frequentist point estimate $\hat{\beta}_k$ paired with a standard error $\mathrm{SE}_k$ is - under mild assumptions - a Gaussian likelihood. That likelihood plugs straight into the conjugate update alongside the MMM prior:
The translation table is short: estimator output $\hat{\beta}_k$ becomes the likelihood mean; $\mathrm{SE}_k^2$ becomes the likelihood variance $\sigma_e^2$; the MDE relates to the achievable $\sigma_{\text{exp}}$ through the design's power calculation. Every frequentist estimator the measurement team already runs - DiD, SCM, ASC, time-based regression, BSTS / CausalImpact - feeds the same loop without anyone switching teams.
When the bridge isn't enough - small samples, multiple competing estimators, sequential monitoring, or borrowing strength across markets - go fully Bayesian on the experiment side too. Tools include BEST (robust Bayesian estimation), HDI + ROPE for decision-theoretic readouts, hierarchical / multilevel models, Bayesian synthetic control, and sequential Bayesian inference (no alpha-spending required). These all output posteriors that plug into Eq. (11) directly.
Implementation: A 90-Day Rollout
A staged rollout with explicit decision gates. Phase 1 is a 90-day pilot that delivers decision-grade evidence about whether to continue. The steady-state quarterly cadence is what comes after.
- Audit existing MMM & geo-lift artifacts
- Build Bayesian MMM in parallel
- Posterior predictive checks pass
- Side-by-side OLS vs. Bayes readout
- Compute first EIG/EVOI grid
- Produce 2x2 priority map
- Stakeholder review & selection
- Pre-register top 2-3 experiments
- Run pre-registered experiments
- Refit MMM with calibration priors
- Pre/post allocation comparison
- Phase 1 go/no-go gate
- Quarterly priority + calibration cycle
- OLS pipeline retired after 2 cycles
- Re-experimentation triggers fire
- Confidence-tier reporting standard
Prerequisites
- Geo-week panel data - 3+ years of the outcome metric (revenue, conversions, sessions) at the geo grain. Daily is fine; the model will aggregate.
- Per-channel media spend at the same geo grain or with a documented allocation rule from national to geo.
- Pre-period MAPE benchmarks from the existing MMM, by channel and aggregate. Needed to credibly compare the new pipeline.
- Past geo-lift readouts with point estimates and standard errors. These are the calibration inputs for the first cycle. Verdicts ("CTV worked") are not enough - we need the underlying numbers.
- Tooling: PyMC ≥ 5 or NumPyro for the Bayesian MMM and hierarchical experiment models; ArviZ for diagnostics; pyfixest or fixest (R) for the frequentist DiD baseline; JAX optional but recommended for speed.
Success Metrics
| Metric | Definition | Target | Cadence |
|---|---|---|---|
| Holdout MAPE | Out-of-sample predictive accuracy of the calibrated MMM | ≤ OLS baseline |
Each cycle |
| Posterior contraction | Average reduction in $\sigma_k$ for tested channels post-calibration | ≥ 30% |
Each cycle |
| Misallocation Δ | Weekly misallocation cost vs. last cycle | Falling, then flat |
Each cycle |
| Portfolio mROI | Marginal return of the next dollar across the portfolio | Rising over 4 cycles |
Annual review |
| Calibration coverage | Fraction of spend in experiment-backed channels (vs. model-only) | ≥ 60% by Year 1 |
Each cycle |
| Stakeholder fluency | Sponsor and planning team can explain priority map & confidence tiers in their own words | Yes by Cycle 3 |
Cycle 3 review |
Anti-Patterns & Failure Modes
Six failure modes seen in similar programs. Calling them out up front so we recognize them when they happen.
- "Run all the experiments we can think of." Submodularity says marginal information per experiment falls fast. Running 8 experiments to learn what 3 would have taught you is wasted budget. The priority engine is supposed to cap the program; treating it as an unconstrained idea-generator defeats the point.
- "Calibrate every channel, every cycle." Calibration has a shelf life and a cost. Re-experimenting on a channel whose last test is 4 months old and stable is just lighting test budget on fire. Use the re-experimentation trigger; it exists for this reason.
- "Use a flat prior so we're not biasing the answer." A flat prior reproduces the OLS pathology. Priors are how you regularize; refusing to use informative priors is a refusal to do the regularization work. Defensible > flat.
- "The Bayesian MMM is wider than OLS, so it's worse." The Bayesian MMM is wider because the OLS interval was lying about its precision. Width is information, not weakness. Reporting templates need to make this distinction explicit.
- "Let's just plug the experimental point estimate in as a fixed parameter." Hard calibration by point-substitution discards the experimental SE and produces over-confidence in the calibrated channel. Use the soft prior (M1) or augmented likelihood (M2). Never the point.
- "The model said it; I just report what comes out." The Bayesian MMM "says" what your priors and likelihood structure say. The analyst owns those choices; reporting templates should attribute structural decisions to the analyst, not to "the model."
The most insidious failure mode
Stakeholders accept the new framework but quietly continue making allocation decisions on instinct, treating the priority map as a research artifact rather than a budget recommendation. This is invisible from the inside - the team produces the deliverables, gets compliments, and nothing gets used. Detection: at the second-cycle review, ask the sponsor to point at a specific budget reallocation that happened because of a calibrated posterior. If they can't, the program isn't actually running yet - only the appearance of it is.
Closing Principle
The MMM and geo-lift experiments are not competing measurement paradigms - they are complementary nodes in a Bayesian inference graph. The MMM provides a coherent joint model of all channels with full coverage; experiments provide local causal identification for high-priority channels. Information flows are bidirectional: the MMM shapes experiment design (via EIG/EVOI prioritization), and experiments calibrate the MMM (via informed priors or augmented likelihoods). Over successive cycles, this adaptive loop systematically contracts the uncertainty that actually matters for budget decisions.
Where to go next
To understand the math foundations in more depth, see the Bayesian Workflow and Causal Inference guides. To see how the MMM is implemented, see the Modeling Guide and the Technical Guide. To see what the framework produces, see the Demos & Reports.