From experimental evidence to defensible budget allocation — the full loop, with the math working live.
This workshop walks the full decision-theoretic stack behind the closed-loop calibration framework. Every section has a working calculator. Drag the sliders and watch what changes — the formulas on the page are the same ones running in your browser. By the end you'll have a concrete sense of why measurement programs need to be designed around decisions, not data, and how to balance value, cost, and operational risk to actually pick what to test next quarter.
An MMM is not the deliverable. The budget allocation is. Frame the math accordingly.
A decision problem in classical decision theory has four parts. Every measurement question eventually has to answer all of them, even if implicitly:
State $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_K)$ is the unknown channel-level ROI vector. Action $\mathbf{b} = (b_1, \ldots, b_K)$ is the budget allocation we choose, subject to $\sum_k b_k = B$.
Utility $U(\mathbf{b}, \boldsymbol{\theta}) = \sum_k \theta_k\, f(b_k)$ with concave response $f(\cdot)$. Constraints include channel floors, ops capacity for tests, calendar windows, and creative readiness.
The choice we actually want to make is the budget that maximises expected utility under our current beliefs. With perfect knowledge that's $\mathbf{b}^*(\boldsymbol{\theta}) = \arg\max U$. With uncertainty it's $\mathbf{b}^*(\bar{\boldsymbol{\theta}})$ where $\bar{\boldsymbol{\theta}}$ is the posterior mean — but the value of that decision degrades with uncertainty in ways most teams don't quantify.
Even with the same expected ROI, two strategies can be evaluated very differently depending on risk preference. CRRA (constant relative risk aversion) utility $u(x) = x^{1-\rho}/(1-\rho)$ captures this with one parameter. Drag $\rho$ to see the gap between expected value and certainty equivalent open up.
A wider credible interval is not just "fuzzy precision" — under any concave utility it is a real, measurable cost. The certainty equivalent of a noisy ROI estimate is strictly less than its mean. That gap is what an experimental program is buying down.
A confidently-wrong MMM can be worse than no model at all — you pay to build it, then commit the wrong dollars before anyone notices.
Frame misallocation cost as the gap between the value we'd produce knowing $\theta$ exactly and the value we actually produce optimising against our point estimate $\hat\theta$. Under a concave response curve and a budget constraint, that gap grows with posterior variance even when the posterior mean is correct on average.
A two-channel portfolio with budget $B = \$1\text{M}$. True ROI is identical for both channels (so the optimal split is 50/50). The model's point estimate says one channel is better — by an amount drawn from the posterior. The wider that posterior, the further we drift from optimum on average. Drag the σ slider to see expected misallocation cost rise.
Misallocation cost scales quadratically with posterior σ when the response curve has positive curvature near the optimum. Halving σ doesn't halve the cost — it cuts it by roughly four. This is why moving even one channel from "model-only" to "experiment-backed" pays back so quickly.
No magic. The MMM is the prior, the experiment is the likelihood, the next MMM uses the posterior.
Under the standard Gaussian-Gaussian setup, the posterior after one experiment is closed-form: precisions add, means are precision-weighted. This is the workhorse formula behind every "soft calibration" mechanism in the framework.
Set the MMM prior (sage) and a candidate experimental result (gold). The posterior (forest) is computed live. Notice: the posterior always sits between — but it leans hard toward whichever distribution is narrower. A precise experiment overwhelms a vague prior; a confident prior pushes back against a noisy experiment.
M1 (informed prior) applies this update between MMM fits. M2 (likelihood augmentation) folds the experimental term into one joint posterior. M3 (hierarchical pooling) lets a tested channel pull untested siblings via a shared category-level mean. All three reduce, in the Gaussian limit, to versions of the formula above.
A scalar, in bits, that ranks experiments by how much they'd shrink our beliefs.
EIG is the expected reduction in posterior entropy from running an experiment. For Gaussian conjugate updates it has a clean closed form depending only on the prior-to-experiment variance ratio:
The heatmap shows EIG over the (prior σ, experiment σ) plane. Channels in the upper-left — loose prior, precise experiment — are where information gain is maximised. Drag the markers below to position your channel and read off its EIG.
EIG is concave in the variance ratio. Going from a 1× ratio to 4× buys ~1 bit; going from 4× to 16× buys only one more. Diminishing returns are baked in. The right experiment for a fuzzy channel is one where you can hit a moderate variance ratio, not the most precise possible — you're paying linear cost for logarithmic gain.
Bits don't pay the rent. Translate uncertainty reduction into dollars before prioritising.
EVOI prices the information from an experiment by the expected dollar improvement in the budget decision that follows. The linearised approximation makes the structure transparent:
Four levers: budget at stake, share going to this channel, magnitude of the expected belief revision, and the probability that the revision crosses a budget-decision threshold.
Each slider moves one of the four levers. Watch how a high-EIG experiment on a tiny channel can have low EVOI (huge variance reduction, tiny stakes), while a moderate-EIG experiment on a 30%-share channel near the decision threshold can have $500K+ value per cycle.
EIG is necessary but not sufficient. A channel with huge prior σ but 1% spend share has enormous information gain and trivial decision value. The next section combines both into the priority score that picks each cycle's experimental portfolio.
Tune the weight between learning and decision value. Watch the recommended portfolio rearrange itself.
The composite priority score combines normalised EIG and EVOI with a tradeoff parameter $\lambda$:
$\lambda = 1$ is pure information maximisation (pick the channels we know least about, regardless of stakes). $\lambda = 0$ is pure decision-quality (only test where the answer would change the budget). Most healthy programs sit around $\lambda = 0.3$–$0.5$.
A realistic six-channel portfolio with the EIG/EVOI characteristics noted in the table. Drag $\lambda$ and the operational cap. The 2×2 priority matrix updates live; the table re-ranks; selected experiments show a green badge.
| Channel | Spend | σk | σexp | EIG (bits) | EVOI ($K) | Score | Pick |
|---|
Operational caps, geo conflicts, brand-safety risk. Picking a portfolio is a constrained optimisation, not a beauty contest.
Real selection adds two penalty terms to the priority score: an experimentation cost ($\gamma$) and an operational risk ($\rho$). Cost discourages large geo footprints; risk discourages tests that could create brand exposure or geo overlap with running campaigns.
EIG of a set of experiments is submodular — greedy selection is provably within $1 - 1/e \approx 63\%$ of the unconstrained optimum, and additional experiments past the third or fourth typically yield trivial marginal value. The bar chart below visualises that decay.
Same six channels as § 6, now with cost and risk penalties applied. The marginal-EIG bars show what each additional selected experiment contributes; the cumulative line is the running portfolio EIG. Past the third experiment, marginal returns drop sharply.
A test that overlaps with a brand-sensitive launch, or that competes for geos already running a separate experiment, can have ROI-positive math but enterprise-negative consequences. The risk penalty $\rho$ exists to make this explicit. Don't run experiments that win on a spreadsheet but lose in the executive review.
A single experiment is a one-shot gain. The loop is a productivity curve. Run it.
Each quarterly cycle re-fits the MMM, recomputes priorities, runs 2-3 experiments, calibrates, re-allocates, and re-scores. Calibrated channels see their EIG and EVOI drop — previously lower-priority channels rise. The simulator below traces the four headline metrics across N cycles.
Five channels, configurable starting MMM uncertainty, calibration mechanism, and cycle count. The four panels show: posterior CI widths contracting on tested channels, allocation share migrating to true high-ROI channels, weekly misallocation cost falling, and portfolio marginal ROI rising. Dashed traces show the no-experiments counterfactual.
Top-left = epistemic outcome (what we know better). Top-right = operational outcome (what we do differently as a result). Bottom-left = economic outcome (what poor knowledge was costing us). Bottom-right = strategic outcome (the bottom-line lift in marginal productivity). A defensible measurement program reports all four to the sponsor every cycle.
The right number of experiments per cycle is whatever sits on the elbow of the cost-value curve.
Total experiment cost rises linearly with the number of tests in a cycle; captured EVOI rises concavely thanks to submodularity. The net value $V(n) = \mathrm{EVOI}(n) - c \cdot n$ has a maximum, almost always between two and four tests for a five-channel portfolio.
Drag the per-test cost and the channel count. The curve shows captured EVOI minus total cost as a function of how many experiments we run that cycle. The marker tracks the optimum.
The number of useful experiments per cycle is bounded above by submodularity, not by management appetite. Even with infinite budget and zero ops constraints, the marginal information from the fourth or fifth experiment is usually one-tenth of the first. Plan around the elbow; do not let "we should test more" smuggle in tests with negative net value.
A geo-level MMM is not just a measurement artifact — it is an experiment design engine. The posterior tells you which geos to pick, how much budget to shift, and how long to run.
Once an MMM is fitted with geo-level variation, every posterior sample doubles as a forward simulation: what would KPI do in market m if spend rose by $\Delta b$? The standard measurement playbook defaults to picking "similar-looking" markets by eye, applying a round percentage lift, and running for six to eight weeks. Each of those defaults — market selection, treatment intensity, duration — is a design choice that meaningfully affects information yield, and a geo-level Bayesian MMM gives you better answers for all of them.
A hierarchical geo-level MMM produces a richer set of outputs than the channel-level marginals used for channel selection. The quantities relevant to test design are:
| Quantity | Symbol | Use in test design |
|---|---|---|
| Per-geo channel posterior | $\beta_{k,g} \mid \mu_{k,g},\,\sigma_{k,g}^2$ | Identifies geos with most uncertainty (treated-geo ranking) |
| Per-geo current spend | $x_{k,g}$ | Anchors the saturation-curve evaluation point |
| Per-geo saturation curve | $S_k(\cdot;\,h_{k,g},\kappa_{k,g})$ | Local slope determines optimal $\Delta_{\text{spend}}$ |
| Per-geo residual variance | $\sigma_g^2$ | Plugs directly into the $\sigma_{\text{exp}}$ formula |
| Within-geo serial correlation | $\rho_g$ | Determines how duration converts to precision |
| Geo random-effect posteriors | $\mathbf{u}_g$ | Mahalanobis distance for matched-pair construction |
| Cross-geo posterior correlations | $\mathrm{Cor}(\beta_{k,g},\beta_{k,g'})$ | Spillover and contamination diagnostics |
The ideal (treatment, control) geo pair satisfies three criteria: high pre-period correlation on the KPI (well-specified counterfactual), comparable absolute scale (neither side dominates), and low cross-geo spillover (independent markets). A simple balance score captures the first two:
But correlation on raw KPI is a noisy proxy. The richer signal from the geo-level MMM enables a proper information yield score that combines uncertainty, saturation slope, and residual noise into a single ranking criterion:
The numerator captures the prior variance scaled by the responsiveness of outcome to spend at the current operating point; the denominator penalizes geos with noisy outcomes. Rank candidate geos by $I_{k,g}$, take the top $G_T$ (typically 8–24 markets), and apply hard eligibility filters (minimum population, no recent anomalies) before ranking — not as tiebreakers after.
For each treated geo, the control pool must be genuinely similar in the dimensions that matter for channel $k$'s effect. A hierarchical MMM gives you a far better metric than correlation matching: each geo has a posterior over its random-effect vector $\mathbf{u}_g$ (channel-coefficient deviations, baseline level, seasonality). Genuinely matched markets have similar $\mathbf{u}_g$ posteriors. The natural distance is Mahalanobis with the joint posterior covariance:
Markets close in this metric will respond similarly to channel-$k$ perturbations even if their raw revenue trajectories differ in level or seasonality. In typical retail and DTC datasets, Mahalanobis matching on the MMM posterior reduces post-period control-group variance by 20–40% versus correlation matching on the same pre-period.
Geos sharing a regional hyperparameter cluster together in posterior space. The posterior covariance over geo-level random effects is therefore already a similarity matrix — no separate matching model required. For each treated geo, take the $K$ closest control geos ($K \in \{1,2,3\}$ is typical) and use them as the donor pool.
The signal of the experiment is the KPI lift attributable to the budget increment in treatment geos. Standard practice sets the uplift as a round percentage (e.g., "add 30%") across all geos. But the marginal return to spend is different at each geo's current operating point on the saturation curve, so the same percentage uplift produces very different signals.
The MMM gives you the local slope of the saturation curve at each geo's observed spend:
A geo already near saturation has a small slope — a large spend increment is needed to generate a detectable lift. A geo well below saturation has a steep slope — a modest increment suffices. Equating the target signal-to-noise ratio across geos gives the saturation-slope-aware optimal spend increment:
where $\sigma_g$ is the geo's residual standard deviation, $\rho_g$ its within-geo serial correlation, $G$ the number of treatment geos, $T$ the planned duration, and $\eta$ the target power. In practice $\Delta\mathrm{spend}_g^*$ varies 3–5× across geos for the same channel — treating every geo identically leaves information on the table.
The effective signal the experiment must detect is:
A wide ROI posterior $\sigma_\theta$ sets a hard ceiling on achievable power regardless of budget — via the delta method, $\mathrm{Var}(\Delta y) \geq \Delta b^2\,\sigma_\theta^2$. Check that ceiling before committing to the design (see the interactive demo below).
Under a difference-in-differences estimator with $n_t$ treatment geos and $n_c$ control geos observed for $T$ weeks, the naive standard error shrinks as $1/\sqrt{T}$. Weekly KPI data within a geo are serially correlated, however: successive weeks share common demand shocks, promotions, and seasonal patterns. The MMM residuals estimate the within-geo AR(1) coefficient $\rho_g$, and the corrected standard error is:
When $\rho_g = 0$ this reduces to the familiar $1/\sqrt{T}$ formula. When $\rho_g > 0$ the denominator grows slower than $\sqrt{T}$, so each additional week adds less precision. Empirically, Expected Information Gain (EIG) flattens after 6–10 weeks for geos with $\rho_g \in [0.2, 0.5]$ — common values in weekly retail or digital data. Running longer costs budget without proportionally improving power.
Statistical power at duration $T$ is:
Rearranging gives the minimum required duration for target power $1-\beta$:
Because $T^*$ appears on both sides, solve iteratively (or use the calculator below). The dashed curve in the chart adds the MMM's own posterior spread on $\theta$: even with perfect execution, $\mathrm{Var}(\Delta y) = \Delta b^2\,\sigma_\theta^2$ is independent of test duration and sets a hard power ceiling. Check that ceiling before committing.
Drag the sliders to configure the experiment design. The solid curve shows power under the MMM's point-estimate ROI. The dashed curve folds in posterior ROI uncertainty — it plateaus once the MMM's own spread dominates the residual noise. The vertical marker shows where your target power is first reached.
A tight posterior on $\theta$ (small $\sigma_\theta$) means the model is already confident about channel effectiveness — the experiment only needs to confirm it, so a shorter or smaller design is still informative. A wide posterior signals that the experiment must be large enough to move the MMM needle, not just reach nominal significance. Check whether the power ceiling (plateau of the dashed curve) is above your target before committing to the design.
Before committing real budget, run the proposed design through the fitted MMM as a simulator. This catches power shortfalls and degenerate designs cheaply:
If simulated power falls short of target, increase $\Delta\mathrm{spend}_g^*$, add geos, or extend duration — then re-simulate. This loop typically takes minutes and surfaces issues (skewed KPI distributions, heterogeneous geo variance) that the closed-form formulas miss.
Two geos whose channel posteriors move together are likely contaminated: a spend change in the treatment geo affects the control geo (e.g., national TV spillover, retargeting bleed across DMAs). The MMM captures this as cross-geo posterior correlation:
Flag any control candidate with $|r_{g,g'}| > 0.5$ and exclude it from the donor pool. If all nearby geos are contaminated, consider a holdout design where a cluster of adjacent geos are jointly treated and a geographically distant cluster serves as control. The adstock carryover introduces an analogous time-domain contamination: include a 1–2 week wash-in buffer before measuring lift, or model carryover explicitly in the geo-level analysis.
The table below illustrates Steps 1–6 applied to a Connected-TV channel for a national retailer. Five geos were scored; the top two by information yield became treatment markets, with matched controls chosen by Mahalanobis distance.
| Geo | Role | $I_{k,g}$ rank | Saturation regime | $\Delta\mathrm{spend}_g^*$ | Required weeks $T^*$ |
|---|---|---|---|---|---|
| Atlanta | Treatment | 1st | Sub-saturated ($S'$ = 0.82) | +18% | 7 |
| Phoenix | Treatment | 2nd | Mid-saturated ($S'$ = 0.51) | +31% | 9 |
| Tampa | Control (Atlanta) | — | $d^2 = 0.9$ (closest match) | 0% | — |
| Portland | Control (Phoenix) | — | $d^2 = 1.4$ (closest match) | 0% | — |
| St. Louis | Excluded | 3rd | $r_{g,g'} = 0.71$ (spillover) | — | — |
Atlanta's lower $\Delta\mathrm{spend}^*$ (18% vs 31%) reflects its steeper saturation slope: the same ROI uncertainty is resolved more cheaply where the curve is still responsive. St. Louis was excluded despite its high information yield because its cross-geo posterior correlation with Atlanta ($r = 0.71$) exceeded the contamination threshold.
The same greedy submodular selection logic from §9 applies when you have multiple channels competing for experiment budget. Score every (channel, geo) pair by $I_{k,g}$, run the Mahalanobis exclusion for each candidate, then greedily select pairs up to the experiment budget. The greedy sequence achieves at least $(1-1/e) \approx 63\%$ of the optimal joint information gain — and the MMM gives you the full joint posterior needed to compute it.
After the geo test concludes, the estimated lift $\hat\tau$ with standard error $\hat\sigma_e = \mathrm{SE}(\hat\tau)$ feeds directly into the Bayesian update from §3. The MMM's channel prior $\mathcal{N}(\mu_\theta, \sigma_\theta^2)$ updates by precision-weighted averaging:
Longer tests and more geos shrink $\hat\sigma_e$, increasing EIG. But submodularity applies here too: once $\hat\sigma_e \ll \sigma_\theta$, additional weeks add near-zero information. The design sweet spot is where the experiment's precision is comparable to (not far tighter than) the MMM's prior uncertainty.
Adstock carryover means the first few weeks of a geo test carry contamination from pre-test spend levels — include a 1–2 week wash-in buffer or model carryover explicitly. Geo spillover (retargeting bleed, national TV halo) deflates estimated lift when treatment and control geos are adjacent; flag any control candidate with $|r_{g,g'}| = |\mathrm{Cor}(\beta_{k,g}, \beta_{k,g'})| > 0.5$ and exclude it from the donor pool. The MMM's spatial posterior correlations give you this diagnostic for free — no additional data collection required.
EIG, EVOI, and the priority score replace gut-feel "what should we test next?" with a score you can defend in a deck. The Bayesian update gives you a principled way to fold results back in — no analyst-led adjustments, no opaque overrides.
Each cycle produces three artifacts: the priority map (what's being tested and why), the calibrated allocation (with confidence tiers per channel), and the trajectory chart (compounding evidence of program value). Three pages. One meeting. Every quarter.
Misallocation cost is reported in dollars per week. Portfolio mROI is reported as a running multiple. Calibration coverage is reported as a fraction of spend. The program's payback is on the deck, not buried in a model spec.
Submodularity caps useful experimentation at 2-4 tests per cycle. Information decay triggers re-experimentation on a cadence the data sets, not the calendar. The loop is self-throttling and self-correcting.
For the conceptual framework and the math derivations, see the Closed-Loop Measurement & Calibration guide. For practical guidance on interpreting calibrated outputs in budget meetings, see Interpreting Results. For the full glossary of terms, see the Glossary.