For Business Stakeholders

What Is Specification Shopping?

Specification shopping is the practice of running many model variations and only reporting the ones that produce results stakeholders expect to see. It is the single biggest threat to the integrity of marketing measurement.

⚠️ How It Happens

Analyst runs 30 model variations
Only 3 show TV ROI above 1.0
Those 3 are presented as “the model”
Results look precise but are selection artifacts
Budget decisions made on false confidence

✓ What Should Happen

Model specification defined before seeing results
Uncertainty honestly reported as probability ranges
All reasonable models considered, not just “good” ones
Sensitivity to modeling choices tested and reported
Budget decisions account for genuine uncertainty

Why This Matters

When you test many specifications and select based on results, standard statistical inference is invalidated. The confidence intervals no longer mean what they claim to mean. The process systematically selects for confirming rather than disconfirming evidence.

Why Causal, in Plain Terms

A marketing mix model is supposed to answer a causal question—"what would sales have been if we hadn't run this media?"—not just report what moved together. The distinction matters because lots of things move together without one causing the other. Ice cream sales and sunburns rise in the same weeks, but neither causes the other; summer does. In the same way, holiday demand lifts both your ad spend and your sales at the same time—so a careless model hands media the credit for sales the season would have delivered anyway.

The number that should drive budget decisions is incremental impact: the sales your media actually caused, above what would have happened without it. Spend that merely coincides with sales is spend you could cut without losing anything; spend that causes sales is spend worth defending and scaling. A model that cannot tell these apart will confidently send budget to the wrong channels.

How This Framework Separates Coincidence from Contribution

It accounts for confounders like underlying demand, locks the model design in before results are seen, and checks its answers against real-world experiments such as regional holdout tests. The result is an estimate of what your media actually caused—stated with honest uncertainty.

See Confounding in Action

A year of weekly data. Spend ramps into the demand season—exactly when sales were going to rise anyway. Switch models and watch who gets the credit.

Media ROI the model reports

2.4×

What it means

The naive model hands media the season’s sales—the rooster taking credit for the sunrise. Budgets set on this number buy more roosters, not more sunrises.

Illustrative data. The same pattern is measured on synthetic worlds with known truth in pressure testing.

Who Is Affected

Specification shopping is pervasive across the marketing measurement ecosystem. Each role faces distinct pressures that encourage the practice and distinct consequences when it fails.

🏢

Ad Agencies

Proving campaign value

Under pressure to demonstrate ROI for the media they placed. Models that show poor performance threaten client relationships and revenue.

📊

Model Shops

Delivering expected results

Hired to build models that “make sense.” When results contradict expectations, there is pressure to adjust until they align with priors.

🏭

Brands & Advertisers

Making budget decisions

Rely on model outputs to allocate millions in media spend. False precision from specification shopping leads to misallocated budgets.

For Ad Agencies

Agencies face a structural conflict of interest: they are often asked to measure the effectiveness of media they themselves placed. This creates subtle but powerful incentives to find positive results.

💰

The Incentive Problem

When your revenue depends on clients continuing to spend on channels you manage, models that show low ROI for those channels threaten your business. This creates unconscious bias in modeling decisions even among well-intentioned analysts.

🔍

How It Manifests

Adjusting adstock decay rates until a channel “looks right.” Removing control variables that reduce media coefficients. Zeroing out negative effects and calling them “non-significant.” Choosing between model variants based on which gives the “most reasonable” ROIs.

The Opportunity for Agencies

Agencies that adopt transparent, pre-specified modeling methodologies differentiate themselves from competitors. When your models are validated against holdout experiments, you can demonstrate genuine value rather than asserted value. This builds deeper, more durable client relationships.

For Model Shops

Dedicated analytics firms face the “client satisfaction versus scientific rigor” tension daily. The commercial pressure to deliver results that “make sense” is real, but it comes at a cost.

⚖️

The Credibility Trap

If clients hire you expecting certain results and you deliver models that confirm expectations, no one complains. But when two different model shops produce contradictory results for the same brand using the same data, the industry’s credibility erodes.

📈

The Validation Gap

Most model shops cannot point to systematic validation of their predictions. How often are model-implied ROIs tested against controlled experiments? Without this feedback loop, there is no mechanism to distinguish good models from specification-shopped ones.

When everyone uses the same biased methods, an entire industry can be confidently wrong. The first firms to break this cycle will have a significant competitive advantage.

— the framework’s design principle

Common Specification Shopping Practices

Practice	Why It’s Done	Why It’s Harmful
Zeroing out negative media effects	“Media can’t have negative ROI”	Systematically biases all estimates upward; makes uncertainty invisible
Tuning adstock until results look right	“Domain knowledge about decay rates”	When done after seeing results, it’s a form of p-hacking; invalidates inference
Dropping control variables that reduce media effects	“Multicollinearity issues”	Omitting confounders inflates causal estimates; leads to incorrect attribution
Selecting “best” model from many candidates	“Model selection is standard practice”	Winner’s curse: the selected model overestimates effect sizes; reported uncertainty is too narrow
Adjusting priors to match expected ROIs	“Incorporating domain knowledge”	When done iteratively after seeing posteriors, this is Bayesian specification shopping; the posterior no longer reflects an honest belief update

For Brands & Advertisers

As the end consumers of marketing measurement, brands are both the primary victims and the primary beneficiaries of improved methodology. Understanding what questions to ask is the first step toward better outcomes.

💸

The Budget Impact

If your model overstates TV ROI by 40% due to specification shopping, you are systematically over-investing in television at the expense of other channels. Over a year, this can mean millions of dollars in misallocated spend. The only spend that earns its budget line is incremental spend—media whose causal effect has been measured, and ideally calibrated against a real experiment.

📋

The Year-over-Year Problem

Have you noticed that model results change dramatically year over year even when strategy is stable? This instability is often a sign of specification shopping—different analysts making different ad hoc choices rather than a systematic change in market dynamics.

Financial Consequences

We will not quote industry-wide loss statistics here—we have no source we could defend, and inventing one is exactly what this page warns against. What we can show is what happens on synthetic markets where the true ROI is known in advance, from the framework’s own pressure-testing program:

8 of 16

Silent Failures

Synthetic stress worlds where attribution was materially wrong while every standard check looked healthy

+69%

Worst Channel Error

In the hidden-confounder world, one channel’s contribution was overstated by 69% (median error across channels: 23%)

≤1.02

R-hat on Every Silent Failure

Convergence diagnostics stayed green on every one of those wrong answers

Divergences

The sampler raised no alarm—a model can be confidently, quietly wrong

Measured on synthetic worlds with known ground truth — see Pressure Testing for the full scorecard.

The Compounding Effect

Specification shopping doesn’t just produce a single bad estimate. It produces a systematically biased view of your entire media portfolio. The channels that appear most effective are often the ones where the model had the most room to be optimistic—typically those with the least experimental validation.

Case Study: What the Gap Is Worth

In the Aurora Coffee Co. engagement—a synthetic brand whose true channel effects are on file—two planners split the same budget. The dashboard-style plan funded the channels that looked best; the experiment-anchored causal plan funded the channels that were best. Scored against the known ground truth, the causal plan was worth ≈ +$11.9M of revenue per year on the same budget ($11,884k/yr; a first-order estimate whose direction is verified against ground truth).

Measured on the Aurora synthetic engagement (known ground truth) — see the Aurora finale for the full reallocation.

Credibility Risk

The marketing measurement industry faces a growing credibility crisis. As data science matures in other domains and clients become more sophisticated, the gap between standard practice and scientific rigor becomes harder to ignore.

We won’t put percentages on these risks—no one has measured them, and a progress bar with an invented number is the same false precision this page argues against. Qualitatively, the exposures are real:

Client trust: every contradictory year-over-year result, and every pair of model shops delivering opposite answers from the same data, erodes the credibility of the whole category.
Scrutiny: as procurement and finance teams grow more data-literate, unvalidated causal claims invite harder questions—and harder audits.
Competitive displacement: firms that can point to experimental validation of their estimates have a straightforward way to take business from firms that cannot.

How to Detect Specification Shopping

Whether you are commissioning a model or reviewing one, these indicators suggest specification shopping may have occurred.

No negative media effects anywhere

If every channel shows positive ROI, ask: was this constrained or did the data show it? In reality, some channels in some time periods may show negligible or negative incremental effects, especially when over-saturated.

Extremely narrow confidence intervals

If the model says TV ROI is 1.42 (1.38–1.46), ask how this precision was achieved. With typical MMM data, genuine uncertainty is much wider. Artificially narrow intervals are a hallmark of selecting among specifications.

Results perfectly match prior expectations

If every result aligns with what the client expected, ask what would have been reported if results contradicted expectations. A model that always confirms priors is a mirror, not a measurement tool.

Dramatic year-over-year changes with no clear driver

If last year’s model showed TV was strongest and this year shows digital is strongest—with no change in strategy—the modeling process itself is likely the source of variation.

No holdout validation or experimental calibration

If model predictions have never been tested against controlled experiments, there is no empirical basis for trusting the results. In-sample fit measures like R-squared do not validate causal claims.

A Better Approach

The MMM Framework is built from the ground up to eliminate specification shopping while producing genuinely useful business insights.

✕ Traditional Approach

Run many models, report “the best one”
Point estimates with false precision
Post hoc adjustments to “fix” results
No experimental validation
Different analysts, different results
Confidence comes from presentation, not evidence

✓ MMM Framework Approach

Pre-specify model before seeing results
Full posterior distributions with honest uncertainty
Bayesian priors encode domain knowledge transparently
Built-in experimental calibration support
Reproducible: same data, same results
Confidence comes from validated predictions

Rigor here is an ongoing practice, not a one-time deliverable: measurement runs as a loop—fit the model, find where uncertainty is most expensive, run the experiment that buys the most learning, feed the result back in, and reallocate budget with the sharper answer (see the calibration loop). The whole cycle runs in a modern web application with an AI analyst assistant alongside the Python library—take the platform tour.

The Business Case for Rigor

Organizations that adopt rigorous measurement practices don’t just get better models—they get better decisions. When you know which estimates are confident and which are uncertain, you can invest in experiments where they matter most, allocate budget based on validated effects, and build a compounding knowledge advantage over competitors who rely on specification-shopped results.

Calibration: Proof, Not Promises

A model, however careful, is still a recipe written on paper. Calibration is the taste test: the model proposes an answer, and a real-world experiment disposes. The most common form is a geo lift test—matched pairs of cities or regions where one side runs the media and the other holds out. The held-out regions live out the “what would have happened without it” question that no spreadsheet can answer on its own. The gap between the pairs is your incremental effect, measured in the real world.

What makes this a loop rather than a one-off audit is what happens next: the experiment’s readout is fed back into the model as evidence. Estimates that were wide tighten; estimates that were flattering get corrected. Each experiment moves through a tracked lifecycle—designed, pre-registered (so the success criteria are locked before launch, the same discipline that prevents specification shopping), run, read out, and finally folded into the next model fit. The platform manages this end to end, from choosing which test buys the most learning per dollar to scheduling the re-test when the answer goes stale.

Watch Experiments Sharpen the Answer

A confounded model starts out confidently wrong about TV ROI. Each calibrated experiment pulls the estimate toward the truth and narrows the range you would bet a budget on.

Plausible TV ROI range

0.8× – 3.0×

Decision readiness

Too wide to bet a budget on

Illustrative posteriors. The framework’s measured version of this story—a confounded estimate corrected by calibration on a world with known truth—is in the end-to-end walkthrough.

⌛

Calibration Has a Shelf Life

Markets shift, creative wears out, competitors react—so a lift test from two years ago is a memory, not a measurement. The platform’s default decay assumptions—modeling priors you can change, not measured industry facts—treat digital learnings as fading with a half-life of about six months, and slower brand and TV learnings about a year. The measurement loop treats this like creative rotation: it tracks how stale each channel’s evidence is and schedules the re-test before decisions start leaning on expired knowledge.

Questions to Ask Your Modeling Partner

Whether you are evaluating a new vendor or auditing existing work, these questions help distinguish rigorous measurement from specification shopping.

The gold standard is a pre-registered analysis plan that specifies model structure, priors, and decision criteria before the model is fit to data. Ask for the analysis plan and compare it to the final delivered model.

There is nothing wrong with testing multiple specifications, but all should be reported. If 30 models were run and only 1 is presented, the uncertainty is vastly understated. Ask for a sensitivity analysis showing how results change across reasonable specifications.

The most powerful validation is comparing model-implied predictions against holdout experiments (geo lift tests, randomized controlled trials). If the modeling partner cannot point to any experimental validation, the model’s causal claims are untested.

If the answer is “we constrain it to be positive” or “we adjust the model,” this is specification shopping. Negative effects are valid findings that indicate over-saturation, poor creative, or confounding. Honest measurement sometimes delivers unwelcome news.

If the answer is “we report point estimates,” push for credible intervals. If intervals seem implausibly narrow, ask what assumptions produce that precision. Genuine Bayesian credible intervals for MMM are typically wide enough to affect optimization decisions.

Reproducibility is the minimum bar for scientific claims. If the modeling partner cannot provide code that reproduces their results, the work cannot be independently verified. The MMM Framework is fully open source and reproducible by design.

A strong answer names specific channels, specific tests, and dates. Calibration is perishable—this platform’s default working assumption (a configurable prior, not an industry statistic) is that digital learnings fade within about six months and brand or TV learnings within about a year—so “we ran a lift test once” is not a standing guarantee. Ask for the re-test schedule. A partner running a genuine measurement loop will have one; a partner who validated once and moved on is presenting a snapshot as a system.

Getting Started with Rigorous Measurement

Whether you are an agency, model shop, or brand, the transition to rigorous measurement follows a common path.

Assess your current practices

Review your existing modeling workflow against the detection criteria above. Identify where post hoc adjustments are made and where specification choices are data-driven rather than pre-specified.

Start with one project

Pick a single client or brand and run the full rigorous workflow alongside your existing approach. Compare the results and understand the differences.

Step-by-Step Modeling Guide →

Design validation experiments

Use model predictions to design geo lift tests or holdout experiments. This creates the feedback loop needed to distinguish working models from non-working ones—the heart of the measurement loop.

Design Experiments in the Platform →

Communicate uncertainty as a feature

Train stakeholders to see honest uncertainty ranges as more valuable than false precision. When you say “we are confident TV ROI is between 1.1 and 1.8” it enables better decisions than “TV ROI is 1.42.”

Interpreting Results for Stakeholders →

Build organizational capability

Invest in training your team on Bayesian methods, causal inference, and the MMM Framework. This is a long-term competitive advantage, not just a tool change.

Key Takeaway

Marketing measurement is getting more rigorous, and clients are learning to tell the difference. The organizations that move first won’t just have better models—they’ll have client relationships built on validated results instead of confident presentation. The MMM Framework provides the tools—the rest is organizational commitment to honesty.

Ready to Learn More?

Explore the step-by-step modeling guide for implementing statistically sound models, or read interpreting results for guidance on communicating findings to media planners and CMOs.

The Hidden Risk in Your Marketing Models