Calibration & Decision Theory Workshop

§ 1 · Setting the frame

The Decision Problem

An MMM is not the deliverable. The budget allocation is. Frame the math accordingly.

A decision problem in classical decision theory has four parts. Every measurement question eventually has to answer all of them, even if implicitly:

State space & action space

State $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$ is the unknown channel-level ROI vector. Action $\mathbf{b} = (b_1, \ldots, b_K)$ is the budget allocation we choose, subject to $\sum_k b_k = B$.

Utility & constraints

Utility $U(\mathbf{b}, \boldsymbol{\theta}) = \sum_k \theta_k\, f(b_k)$ with concave response $f(\cdot)$. Constraints include channel floors, ops capacity for tests, calendar windows, and creative readiness.

The choice we actually want to make is the budget that maximises expected utility under our current beliefs. With perfect knowledge that's $\mathbf{b}^*(\boldsymbol{\theta}) = \arg\max U$. With uncertainty it's $\mathbf{b}^*(\bar{\boldsymbol{\theta}})$ where $\bar{\boldsymbol{\theta}}$ is the posterior mean — but the value of that decision degrades with uncertainty in ways most teams don't quantify.

1.1 · Risk preference and certainty equivalents

Even with the same expected ROI, two strategies can be evaluated very differently depending on risk preference. CRRA (constant relative risk aversion) utility $u(x) = x^{1-\rho}/(1-\rho)$ captures this with one parameter. Drag $\rho$ to see the gap between expected value and certainty equivalent open up.

Risk aversion ρ 0.50

Outcome A $1.40

Outcome B $1.00

Expected return

$1.20

Certainty equivalent

$1.18

Risk premium

$0.02

Why this matters for MMMs

A wider credible interval is not just "fuzzy precision" — under any concave utility it is a real, measurable cost. The certainty equivalent of a noisy ROI estimate is strictly less than its mean. That gap is what an experimental program is buying down.

§ 2 · The dollar cost of uncertainty

Why Tight CIs Aren't the Same as Right Answers

A confidently-wrong MMM can be worse than no model at all — you pay to build it, then commit the wrong dollars before anyone notices.

Frame misallocation cost as the gap between the value we'd produce knowing $\theta$ exactly and the value we actually produce optimising against our point estimate $\hat\theta$. Under a concave response curve and a budget constraint, that gap grows with posterior variance even when the posterior mean is correct on average.

\text{Misallocation cost} \;=\; \mathbb{E}_{\boldsymbol{\theta}}\!\left[ U(\mathbf{b}^*(\boldsymbol{\theta}), \boldsymbol{\theta}) - U(\mathbf{b}^*(\hat{\boldsymbol{\theta}}), \boldsymbol{\theta}) \right]

2.1 · The cost of a wider posterior

A two-channel portfolio with budget $B = \$1\text{M}$. True ROI is identical for both channels (so the optimal split is 50/50). The model's point estimate says one channel is better — by an amount drawn from the posterior. The wider that posterior, the further we drift from optimum on average. Drag the σ slider to see expected misallocation cost rise.

Posterior σ on each ROI 0.30

Saturation steepness k 0.70

Expected misalloc cost / wk

$48K

% of budget wasted

4.8%

Annualised

$2.5M

The hidden compounding

Misallocation cost scales quadratically with posterior σ when the response curve has positive curvature near the optimum. Halving σ doesn't halve the cost — it cuts it by roughly four. This is why moving even one channel from "model-only" to "experiment-backed" pays back so quickly.

§ 3 · The mechanics

Calibration Is a Conjugate Update

No magic. The MMM is the prior, the experiment is the likelihood, the next MMM uses the posterior.

Under the standard Gaussian-Gaussian setup, the posterior after one experiment is closed-form: precisions add, means are precision-weighted. This is the workhorse formula behind every "soft calibration" mechanism in the framework.

\sigma_{\text{post}}^{-2} \;=\; \sigma_0^{-2} + \sigma_e^{-2}, \qquad \mu_{\text{post}} \;=\; \sigma_{\text{post}}^2\!\left(\frac{\mu_0}{\sigma_0^2} + \frac{\hat{y}}{\sigma_e^2}\right)

3.1 · Posterior = prior × likelihood, weighted by precision

Set the MMM prior (sage) and a candidate experimental result (gold). The posterior (forest) is computed live. Notice: the posterior always sits between — but it leans hard toward whichever distribution is narrower. A precise experiment overwhelms a vague prior; a confident prior pushes back against a noisy experiment.

Prior μ₀ (MMM mean) 1.40

Prior σ₀ (MMM σ) 0.40

Experiment ŷ (mean) 2.10

Experiment σ_e 0.25

Posterior μ

1.85

Posterior σ

0.21

Prior weight

28%

Experiment weight

72%

Three calibration mechanisms, one math

M1 (informed prior) applies this update between MMM fits. M2 (likelihood augmentation) folds the experimental term into one joint posterior. M3 (hierarchical pooling) lets a tested channel pull untested siblings via a shared category-level mean. All three reduce, in the Gaussian limit, to versions of the formula above.

§ 4 · What we'd learn

Expected Information Gain

A scalar, in bits, that ranks experiments by how much they'd shrink our beliefs.

EIG is the expected reduction in posterior entropy from running an experiment. For Gaussian conjugate updates it has a clean closed form depending only on the prior-to-experiment variance ratio:

\mathrm{EIG}(k, d) \;=\; \tfrac{1}{2}\log_2\!\left(1 + \frac{\sigma_k^2}{\sigma_{\text{exp},k}^2(d)}\right) \quad \text{bits}

4.1 · The EIG landscape

The heatmap shows EIG over the (prior σ, experiment σ) plane. Channels in the upper-left — loose prior, precise experiment — are where information gain is maximised. Drag the markers below to position your channel and read off its EIG.

Channel prior σ_k 0.40

Achievable σ_exp 0.20

EIG (bits)

1.16

σ-ratio (prior / exp)

2.00×

Posterior σ (post-test)

0.18

CI shrinkage

55%

The headline shape of the formula

EIG is concave in the variance ratio. Going from a 1× ratio to 4× buys ~1 bit; going from 4× to 16× buys only one more. Diminishing returns are baked in. The right experiment for a fuzzy channel is one where you can hit a moderate variance ratio, not the most precise possible — you're paying linear cost for logarithmic gain.

§ 5 · What that learning is worth

Expected Value of Information

Bits don't pay the rent. Translate uncertainty reduction into dollars before prioritising.

EVOI prices the information from an experiment by the expected dollar improvement in the budget decision that follows. The linearised approximation makes the structure transparent:

\mathrm{EVOI}(k) \;\approx\; B \cdot s_k \cdot \mathbb{E}\!\left[\,\big|\hat{\theta}_{k,\text{post}} - \hat{\theta}_{k,\text{prior}}\big|\,\right] \cdot \mathbb{P}(\text{decision flips})

Four levers: budget at stake, share going to this channel, magnitude of the expected belief revision, and the probability that the revision crosses a budget-decision threshold.

5.1 · EVOI calculator

Each slider moves one of the four levers. Watch how a high-EIG experiment on a tiny channel can have low EVOI (huge variance reduction, tiny stakes), while a moderate-EIG experiment on a 30%-share channel near the decision threshold can have $500K+ value per cycle.

Budget at stake $B (M$/wk) $5.0M

Channel spend share s_k 25%

Prior σ_k on ROI 0.40

Achievable σ_exp 0.25

Posterior mean (current) 1.20

Decision threshold 1.00

EVOI per cycle

$220K

P(decision flips)

22%

Expected belief shift

$0.12

Annualised value

$880K

EIG and EVOI together: the priority engine

EIG is necessary but not sufficient. A channel with huge prior σ but 1% spend share has enormous information gain and trivial decision value. The next section combines both into the priority score that picks each cycle's experimental portfolio.

§ 6 · Picking the portfolio

The Priority Matrix and the λ Tradeoff

Tune the weight between learning and decision value. Watch the recommended portfolio rearrange itself.

The composite priority score combines normalised EIG and EVOI with a tradeoff parameter $\lambda$:

\mathrm{Score}(k) \;=\; \lambda \cdot \widetilde{\mathrm{EIG}}(k) \;+\; (1-\lambda) \cdot \widetilde{\mathrm{EVOI}}(k)

$\lambda = 1$ is pure information maximisation (pick the channels we know least about, regardless of stakes). $\lambda = 0$ is pure decision-quality (only test where the answer would change the budget). Most healthy programs sit around $\lambda = 0.3$–$0.5$.

6.1 · Six channels, one slider

A realistic six-channel portfolio with the EIG/EVOI characteristics noted in the table. Drag $\lambda$ and the operational cap. The 2×2 priority matrix updates live; the table re-ranks; selected experiments show a green badge.

λ (info ↔ value tradeoff) 0.40

Operational cap (max tests / cycle) 3

Channel	Spend	σ_k	σ_exp	EIG (bits)	EVOI ($K)	Score	Pick

Selected portfolio EIG

3.4 bits

Selected portfolio EVOI

$640K

Submodular efficiency

82% of max

§ 7 · Constraints & tradeoffs

Reality Bites: Costs, Risks, and Submodularity

Operational caps, geo conflicts, brand-safety risk. Picking a portfolio is a constrained optimisation, not a beauty contest.

Real selection adds two penalty terms to the priority score: an experimentation cost ($\gamma$) and an operational risk ($\rho$). Cost discourages large geo footprints; risk discourages tests that could create brand exposure or geo overlap with running campaigns.

\mathrm{Score}(k) \;=\; \lambda \cdot \widetilde{\mathrm{EIG}}(k) \;+\; (1-\lambda) \cdot \widetilde{\mathrm{EVOI}}(k) \;-\; \gamma \cdot \mathrm{cost}(k) \;-\; \rho \cdot \mathrm{risk}(k)

EIG of a set of experiments is submodular — greedy selection is provably within $1 - 1/e \approx 63\%$ of the unconstrained optimum, and additional experiments past the third or fourth typically yield trivial marginal value. The bar chart below visualises that decay.

7.1 · Constraint penalties and submodular returns

Same six channels as § 6, now with cost and risk penalties applied. The marginal-EIG bars show what each additional selected experiment contributes; the cumulative line is the running portfolio EIG. Past the third experiment, marginal returns drop sharply.

Cost penalty γ 0.20

Risk penalty ρ 0.15

Test budget cap ($K) $300K

Picked under constraints

3 / 6

Test cost ($K)

$240K

Captured EVOI

$580K

Net value (EVOI − cost)

$340K

Operational risk is not the same as model risk

A test that overlaps with a brand-sensitive launch, or that competes for geos already running a separate experiment, can have ROI-positive math but enterprise-negative consequences. The risk penalty $\rho$ exists to make this explicit. Don't run experiments that win on a spreadsheet but lose in the executive review.

§ 8 · The compounding loop

Multi-Cycle Simulator

A single experiment is a one-shot gain. The loop is a productivity curve. Run it.

Each quarterly cycle re-fits the MMM, recomputes priorities, runs 2-3 experiments, calibrates, re-allocates, and re-scores. Calibrated channels see their EIG and EVOI drop — previously lower-priority channels rise. The simulator below traces the four headline metrics across N cycles.

8.1 · Run the loop

Five channels, configurable starting MMM uncertainty, calibration mechanism, and cycle count. The four panels show: posterior CI widths contracting on tested channels, allocation share migrating to true high-ROI channels, weekly misallocation cost falling, and portfolio marginal ROI rising. Dashed traces show the no-experiments counterfactual.

Cycles to run 6

Starting σ₀ (cross-channel) 0.45

Tests per cycle 2

Calibration mechanism

Final misallocation

$15K / wk

Misallocation reduction

−92%

Final decision efficiency

97%

Calibration coverage

68%

Simulated — these numbers come from the toy five-channel world generated in your browser, not from client data or measured results.

What the four panels are actually showing

Top-left = epistemic outcome (what we know better). Top-right = operational outcome (what we do differently as a result). Bottom-left = economic outcome (what poor knowledge was costing us). Bottom-right = strategic outcome (the bottom-line lift in marginal productivity). A defensible measurement program reports all four to the sponsor every cycle.

§ 9 · Where the program should live

The Decision-Quality Frontier

The right number of experiments per cycle is whatever sits on the elbow of the cost-value curve.

Total experiment cost rises linearly with the number of tests in a cycle; captured EVOI rises concavely thanks to submodularity. The net value $V(n) = \mathrm{EVOI}(n) - c \cdot n$ has a maximum, almost always between two and four tests for a five-channel portfolio.

9.1 · Cost-value tradeoff

Drag the per-test cost and the channel count. The curve shows captured EVOI minus total cost as a function of how many experiments we run that cycle. The marker tracks the optimum.

Cost per experiment ($K) $80K

Channel count K 8

Avg per-channel EVOI ($K) $320K

Optimal # experiments

3

Net value at optimum

$540K

Marginal cost ≅ benefit at

3rd test

Why the curve always has an elbow

The number of useful experiments per cycle is bounded above by submodularity, not by management appetite. Even with infinite budget and zero ops constraints, the marginal information from the fourth or fifth experiment is usually one-tenth of the first. Plan around the elbow; do not let "we should test more" smuggle in tests with negative net value.

§ 10 · Geo experiment design

From Fitted MMM to Field Experiment

A geo-level MMM is not just a measurement artifact — it is an experiment design engine. The posterior tells you which geos to pick, how much budget to shift, and how long to run.

Once an MMM is fitted with geo-level variation, every posterior sample doubles as a forward simulation: what would KPI do in market m if spend rose by $\Delta b$? The standard measurement playbook defaults to picking "similar-looking" markets by eye, applying a round percentage lift, and running for six to eight weeks. Each of those defaults — market selection, treatment intensity, duration — is a design choice that meaningfully affects information yield, and a geo-level Bayesian MMM gives you better answers for all of them.

What the MMM Provides for Experiment Design

A hierarchical geo-level MMM produces a richer set of outputs than the channel-level marginals used for channel selection. The quantities relevant to test design are:

Quantity	Symbol	Use in test design
Per-geo channel posterior	$\beta_{k,g} \mid \mu_{k,g},\,\sigma_{k,g}^2$	Identifies geos with most uncertainty (treated-geo ranking)
Per-geo current spend	$x_{k,g}$	Anchors the saturation-curve evaluation point
Per-geo saturation curve	$S_k(\cdot;\,h_{k,g},\kappa_{k,g})$	Local slope determines optimal $\Delta_{\text{spend}}$
Per-geo residual variance	$\sigma_g^2$	Plugs directly into the $\sigma_{\text{exp}}$ formula
Within-geo serial correlation	$\rho_g$	Determines how duration converts to precision
Geo random-effect posteriors	$\mathbf{u}_g$	Mahalanobis distance for matched-pair construction
Cross-geo posterior correlations	$\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})$	Spillover and contamination diagnostics

Step 1 · Treated-Geo Selection by Information Yield

The ideal (treatment, control) geo pair satisfies three criteria: high pre-period correlation on the KPI (well-specified counterfactual), comparable absolute scale (neither side dominates), and low cross-geo spillover (independent markets). A simple balance score captures the first two:

\text{Balance}(T,\,C) \;=\; \frac{\rho\!\left(y_T^{\text{pre}},\;\hat{y}_C^{\text{pre}}\right)}{1 + \left|\log\!\left(\bar{y}_T / \bar{y}_C\right)\right|}

But correlation on raw KPI is a noisy proxy. The richer signal from the geo-level MMM enables a proper information yield score that combines uncertainty, saturation slope, and residual noise into a single ranking criterion:

I_{k,g} \;=\; \frac{\sigma_{k,g}^2 \cdot \bigl[S_k'(x_{k,g})\bigr]^2}{\sigma_g^2}

The numerator captures the prior variance scaled by the responsiveness of outcome to spend at the current operating point; the denominator penalizes geos with noisy outcomes. Rank candidate geos by $I_{k,g}$, take the top $G_T$ (typically 8–24 markets), and apply hard eligibility filters (minimum population, no recent anomalies) before ranking — not as tiebreakers after.

Step 2 · Matched-Pair Construction via Posterior Mahalanobis Distance

For each treated geo, the control pool must be genuinely similar in the dimensions that matter for channel $k$'s effect. A hierarchical MMM gives you a far better metric than correlation matching: each geo has a posterior over its random-effect vector $\mathbf{u}_g$ (channel-coefficient deviations, baseline level, seasonality). Genuinely matched markets have similar $\mathbf{u}_g$ posteriors. The natural distance is Mahalanobis with the joint posterior covariance:

d^2(g,\,g') \;=\; (\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})^\top\,\boldsymbol{\Sigma}_u^{-1}\,(\boldsymbol{\mu}_{u,g} - \boldsymbol{\mu}_{u,g'})

Markets close in this metric will respond similarly to channel-$k$ perturbations even if their raw revenue trajectories differ in level or seasonality. The heuristic expectation — not a measured benchmark — is that Mahalanobis matching on the MMM posterior should cut post-period control-group variance meaningfully versus correlation matching on the same pre-period, because it matches on the dimensions that actually drive the channel's effect. Validate the gain on your own data before counting on it in a power calculation.

Hierarchical pooling as a matching tool

Geos sharing a regional hyperparameter cluster together in posterior space. The posterior covariance over geo-level random effects is therefore already a similarity matrix — no separate matching model required. For each treated geo, take the $K$ closest control geos ($K \in \{1,2,3\}$ is typical) and use them as the donor pool.

Step 3 · Treatment Intensity from the Saturation Curve

The signal of the experiment is the KPI lift attributable to the budget increment in treatment geos. Standard practice sets the uplift as a round percentage (e.g., "add 30%") across all geos. But the marginal return to spend is different at each geo's current operating point on the saturation curve, so the same percentage uplift produces very different signals.

The MMM gives you the local slope of the saturation curve at each geo's observed spend:

S_k'(x_{k,g}) \;=\; \frac{\partial}{\partial x} S_k(x;\,h_{k,g},\kappa_{k,g})\bigg|_{x=x_{k,g}}

A geo already near saturation has a small slope — a large spend increment is needed to generate a detectable lift. A geo well below saturation has a steep slope — a modest increment suffices. Equating the target signal-to-noise ratio across geos gives the saturation-slope-aware optimal spend increment:

\Delta\mathrm{spend}_g^* \;=\; \frac{\sigma_g}{S_k'(x_{k,g})} \cdot \sqrt{\frac{2\,D(T,\rho_g)}{GT} \cdot \frac{\eta}{1-\eta}} \cdot \frac{1}{\sigma_{k,g}}

Here $D(T,\rho_g)$ is the AR(1) design effect for a $T$-week mean, defined in Step 4 below.

where $\sigma_g$ is the geo's residual standard deviation, $\rho_g$ its within-geo serial correlation, $G$ the number of treatment geos, $T$ the planned duration, and $\eta$ the target power. In practice $\Delta\mathrm{spend}_g^*$ varies 3–5× across geos for the same channel — treating every geo identically leaves information on the table.

The effective signal the experiment must detect is:

\Delta y \;=\; S_k'(x_{k,g})\cdot\Delta\mathrm{spend}_g^*, \qquad \theta \;\sim\; p(\theta \mid \mathcal{D})

A wide ROI posterior $\sigma_\theta$ sets a hard ceiling on achievable power regardless of budget — via the delta method, $\mathrm{Var}(\Delta y) \geq \Delta b^2\,\sigma_\theta^2$. Check that ceiling before committing to the design (see the interactive demo below).

Step 4 · Duration from Serial Correlation

Under a difference-in-differences estimator with $n_t$ treatment geos and $n_c$ control geos observed for $T$ weeks, the naive standard error shrinks as $1/\sqrt{T}$. Weekly KPI data within a geo are serially correlated, however: successive weeks share common demand shocks, promotions, and seasonal patterns. The MMM residuals estimate the within-geo AR(1) coefficient $\rho_g$. Under AR(1), the variance of a $T$-week mean is inflated by the design effect

D(T,\rho_g) \;=\; 1 + 2\sum_{k=1}^{T-1}\Bigl(1-\tfrac{k}{T}\Bigr)\rho_g^{\,k} \;\;\xrightarrow[T\to\infty]{}\;\; \frac{1+\rho_g}{1-\rho_g},

and the corrected standard error is:

\mathrm{SE}(\hat\tau) \;=\; \sigma_y\,\sqrt{\tfrac{1}{n_t} + \tfrac{1}{n_c}} \cdot \sqrt{\frac{D(T,\rho_g)}{T}}

(Do not use the exchangeable design effect $1+(T-1)\rho$ here — that formula is for equicorrelated cluster data; under AR(1) the correlation decays with lag, and the exchangeable factor badly overstates required duration. At $\rho_g=0.4$, $T=8$: exchangeable $3.8$ vs AR(1) $\approx 2.1$.)

When $\rho_g = 0$ this reduces to the familiar $1/\sqrt{T}$ formula. When $\rho_g > 0$ the denominator grows slower than $\sqrt{T}$, so each additional week adds less precision. Working that formula through for $\rho_g \in [0.2, 0.5]$ — plausible values for weekly retail or digital data — Expected Information Gain (EIG) flattens after roughly 6–10 weeks. Treat that as a heuristic expectation from the serial-correlation math, not a measured benchmark. (Note: the interactive calculator below uses the independent-weeks approximation — it does not apply $D(T,\rho_g)$; for autocorrelated KPIs, inflate its required-weeks output by roughly the Step-4 design effect.) Running longer costs budget without proportionally improving power.

Statistical power at duration $T$ is:

\mathrm{Power}(T) \;=\; \Phi\!\left(\frac{\Delta y}{\mathrm{SE}(\hat\tau)} - z_{1-\alpha/2}\right)

Rearranging gives the minimum required duration for target power $1-\beta$:

T^* \;=\; \frac{(z_{1-\alpha/2} + z_{1-\beta})^2\;\sigma_y^2\;(1/n_t + 1/n_c)\,D(T^*,\rho_g)}{\Delta y^2}

Because $T^*$ appears on both sides, solve iteratively (or use the calculator below). The dashed curve in the chart adds the MMM's own posterior spread on $\theta$: even with perfect execution, $\mathrm{Var}(\Delta y) = \Delta b^2\,\sigma_\theta^2$ is independent of test duration and sets a hard power ceiling. Check that ceiling before committing.

10.1 · Geo experiment power calculator

Drag the sliders to configure the experiment design. The solid curve shows power under the MMM's point-estimate ROI. The dashed curve folds in posterior ROI uncertainty — it plateaus once the MMM's own spread dominates the residual noise. The vertical marker shows where your target power is first reached.

Treatment geos 5

Control geos 10

Budget uplift in treatment (%) 30%

Baseline weekly CV per geo 0.20

Channel ROI — MMM posterior mean 1.50

ROI posterior σ from MMM 0.30

Target power 80%

Required weeks

—

MDE at 8 weeks

—

Expected lift per geo

—

Power ceiling (ROI uncertainty)

—

The MMM prior as a power-planning shortcut

A tight posterior on $\theta$ (small $\sigma_\theta$) means the model is already confident about channel effectiveness — the experiment only needs to confirm it, so a shorter or smaller design is still informative. A wide posterior signals that the experiment must be large enough to move the MMM needle, not just reach nominal significance. Check whether the power ceiling (plateau of the dashed curve) is above your target before committing to the design.

Step 5 · Pre-Experiment Simulation

Before committing real budget, run the proposed design through the fitted MMM as a simulator. This catches power shortfalls and degenerate designs cheaply:

Sample a draw $\tilde\theta$ from the MMM channel posterior $p(\theta \mid \mathcal{D})$.
Simulate KPI for each geo under the proposed spend schedule using $\tilde\theta$ plus AR(1) noise with innovation variance $\sigma_g^2$ and coefficient $\rho_g$ — simulate the autocorrelated process itself rather than scaling i.i.d. noise by a design factor.
Estimate treatment effect from the simulated data via the same difference-in-differences estimator you plan to use on real data.
Repeat 500–2,000 times and record the fraction of replicates where the test reaches significance at your chosen $\alpha$.

If simulated power falls short of target, increase $\Delta\mathrm{spend}_g^*$, add geos, or extend duration — then re-simulate. This loop typically takes minutes and surfaces issues (skewed KPI distributions, heterogeneous geo variance) that the closed-form formulas miss.

Step 6 · Spillover and Contamination Diagnostics

Two geos whose channel posteriors move together are likely contaminated: a spend change in the treatment geo affects the control geo (e.g., national TV spillover, retargeting bleed across DMAs). The MMM captures this as cross-geo posterior correlation:

r_{g,g'} \;=\; \mathrm{Cor}\!\bigl(\beta_{k,g},\,\beta_{k,g'}\bigr)

Flag any control candidate with $|r_{g,g'}| > 0.5$ and exclude it from the donor pool. If all nearby geos are contaminated, consider a holdout design where a cluster of adjacent geos are jointly treated and a geographically distant cluster serves as control. The adstock carryover introduces an analogous time-domain contamination: include a 1–2 week wash-in buffer before measuring lift, or model carryover explicitly in the geo-level analysis.

Worked Example — CTV Campaign Design

The table below illustrates Steps 1–6 applied to a Connected-TV channel for a national retailer. Five geos were scored; the top two by information yield became treatment markets, with matched controls chosen by Mahalanobis distance.

Geo	Role	$I_{k,g}$ rank	Saturation regime	$\Delta\mathrm{spend}_g^*$	Required weeks $T^*$
Atlanta	Treatment	1st	Sub-saturated ($S'$ = 0.82)	+18%	7
Phoenix	Treatment	2nd	Mid-saturated ($S'$ = 0.51)	+31%	9
Tampa	Control (Atlanta)	—	$d^2 = 0.9$ (closest match)	0%	—
Portland	Control (Phoenix)	—	$d^2 = 1.4$ (closest match)	0%	—
St. Louis	Excluded	3rd	$r_{g,g'} = 0.71$ (spillover)	—	—

Atlanta's lower $\Delta\mathrm{spend}^*$ (18% vs 31%) reflects its steeper saturation slope: the same ROI uncertainty is resolved more cheaply where the curve is still responsive. St. Louis was excluded despite its high information yield because its cross-geo posterior correlation with Atlanta ($r = 0.71$) exceeded the contamination threshold.

Extending to (channel × geo) pairs

The same greedy submodular selection logic from §9 applies when you have multiple channels competing for experiment budget. Score every (channel, geo) pair by $I_{k,g}$, run the Mahalanobis exclusion for each candidate, then greedily select pairs up to the experiment budget. The greedy sequence achieves at least $(1-1/e) \approx 63\%$ of the optimal joint information gain — and the MMM gives you the full joint posterior needed to compute it.

Closing the Calibration Loop

After the geo test concludes, the estimated lift $\hat\tau$ with standard error $\hat\sigma_e = \mathrm{SE}(\hat\tau)$ feeds directly into the Bayesian update from §3. The MMM's channel prior $\mathcal{N}(\mu_\theta, \sigma_\theta^2)$ updates by precision-weighted averaging:

\sigma_{\text{post}}^{-2} \;=\; \sigma_\theta^{-2} + \hat\sigma_e^{-2}, \qquad \text{EIG} \;=\; \tfrac{1}{2}\log\!\left(1 + \frac{\sigma_\theta^2}{\hat\sigma_e^2}\right)

Longer tests and more geos shrink $\hat\sigma_e$, increasing EIG. But submodularity applies here too: once $\hat\sigma_e \ll \sigma_\theta$, additional weeks add near-zero information. The design sweet spot is where the experiment's precision is comparable to (not far tighter than) the MMM's prior uncertainty.

Adstock carryover, spillover, and the contamination threshold

Adstock carryover means the first few weeks of a geo test carry contamination from pre-test spend levels — include a 1–2 week wash-in buffer or model carryover explicitly. Geo spillover (retargeting bleed, national TV halo) deflates estimated lift when treatment and control geos are adjacent; flag any control candidate with $|r_{g,g'}| = |\mathrm{Cor}(\beta_{k,g}, \beta_{k,g'})| > 0.5$ and exclude it from the donor pool. The MMM's spatial posterior correlations give you this diagnostic for free — no additional data collection required.

§ 11 · Takeaways

What This Framework Buys You

For the modeling team

EIG, EVOI, and the priority score replace gut-feel "what should we test next?" with a score you can defend in a deck. The Bayesian update gives you a principled way to fold results back in — no analyst-led adjustments, no opaque overrides.

For the planning team

Each cycle produces three artifacts: the priority map (what's being tested and why), the calibrated allocation (with confidence tiers per channel), and the trajectory chart (compounding evidence of program value). Three pages. One meeting. Every quarter.

For the CFO / sponsor

Misallocation cost is reported in dollars per week. Portfolio mROI is reported as a running multiple. Calibration coverage is reported as a fraction of spend. The program's payback is on the deck, not buried in a model spec.

For the program over time

Submodularity caps useful experimentation at 2-4 tests per cycle. Information decay triggers re-experimentation on a cadence the data sets, not the calendar. The loop is self-throttling and self-correcting.

Where to go from here

For the conceptual framework and the math derivations, see the Closed-Loop Measurement & Calibration guide. For practical guidance on interpreting calibrated outputs in budget meetings, see Interpreting Results. For the full glossary of terms, see the Glossary.