Scientific Statistical Modeling

A principled approach to building, evaluating, and iterating on statistical models. This guide establishes the philosophical foundations that underpin rigorous quantitative analysis— applicable to Marketing Mix Models and beyond.

Core Insight

Scientific modeling is not about finding "the correct model." It is about building useful representations of reality, understanding their limitations, and honestly communicating what they can and cannot tell us about the questions we care about.

🔬 See It In Action: Interactive Workflow Demo

Walk through the complete scientific modeling workflow step-by-step. Build a Bayesian MMM from question formulation through prior predictive checks, MCMC fitting, diagnostics, and sensitivity analysis—culminating in an example report.

Launch Interactive Demo →

What is a Statistical Model?

Definition

A statistical model is a mathematical description of a data-generating process. It specifies a family of probability distributions indexed by parameters, where different parameter values correspond to different hypotheses about how the data arose.

Models are not photographs of reality—they are useful simplifications. A map of New York City is not the city itself, but it can help you navigate. Similarly, a model of sales response to advertising is not the actual cognitive and behavioral processes involved, but it can help you make better decisions.

This perspective is liberating: we don't need to find the "true" model (which doesn't exist). We need to find models that are useful for our purposes while being honest about their limitations.

All Models Are Wrong

"All models are wrong, but some are useful." — George E.P. Box

Box's aphorism captures a fundamental truth. Every model makes simplifying assumptions that do not perfectly match reality. The question is not "is this model true?" but rather:

A model that captures the main drivers of sales may be highly useful even if it ignores subtle effects that contribute only 2% of variance. The key is knowing which simplifications matter for your question and which don't.

The Generative Perspective

Generative Model

A model is generative if it can be used to simulate new data. Given parameter values, you can run the model forward to produce synthetic observations. This is the key test of understanding: if you cannot simulate data from your model, you do not fully understand it.

The generative perspective offers several advantages:

Advantage Description
Simulation You can generate fake data before seeing real data, checking whether your model can produce plausible outcomes
Prior predictive checks Sample from priors, push through the model, and examine implied predictions—catching implausible assumptions early
Posterior predictive checks After fitting, simulate new data from the posterior and compare to observations—assessing model adequacy
Calibration Test whether predictive intervals have nominal coverage on held-out data
Transparent assumptions Every assumption is explicit in the generative story—nothing hidden in "black box" algorithms

🔬 Interactive: Prior Predictive Simulation

See how prior assumptions translate into predicted data ranges. Adjust the prior parameters and observe the implied distribution of sales predictions.

0.3
0.2

Each line represents one possible data trajectory given the prior assumptions. If the gray band covers implausible values (e.g., negative sales), your priors need adjustment.

The Generative Process

flowchart LR subgraph Prior["Prior Beliefs"] P[/"θ ~ p(θ)"/] end subgraph Model["Generative Model"] L["y ~ p(y|θ)"] end subgraph Output["Predictions"] D[/"Data y"/] end P --> Model Model --> D style Prior fill:#f0f7e6,stroke:#6d8a4a style Model fill:#e6f0f7,stroke:#4a6d8a style Output fill:#f7f0e6,stroke:#8a6d4a

Questions Drive Models

Scientific modeling begins with a question, not with data or techniques. The question determines what model structure is appropriate, what data are needed, and how to evaluate success.

Question Type

Descriptive

"What patterns exist in the data?" Summarize, visualize, identify regularities. Less concerned with causation.

Question Type

Predictive

"What will happen next?" Forecast future values. Model complexity traded against generalization.

Question Type

Causal

"What would happen if we intervened?" Requires identifying assumptions and often experimental validation.

Marketing Mix Models typically aim to answer causal questions: "What is the effect of advertising on sales?" This is the hardest type of question to answer from observational data, and requires explicit assumptions about the data-generating process.

The Generative Story

Every model should have a generative story: a narrative description of how you believe the data came to be. For an MMM, this might be:

Example: MMM Generative Story

  1. There is a baseline level of sales that would occur without any marketing.
  2. This baseline varies over time due to trend and seasonality.
  3. Marketing activities (TV, digital, etc.) increase sales above baseline.
  4. Each channel has carryover effects (adstock) and diminishing returns (saturation).
  5. External factors (weather, economy, competitors) create additional variation.
  6. The observed sales = baseline + marketing effects + external effects + noise.

Writing down this story forces you to be explicit about your assumptions. Each element of the story corresponds to a component of the mathematical model.

Model Components

Component

Likelihood

The probability of observed data given parameters. How is noise distributed? Normal? Heavy-tailed? Count data?

Component

Structural Model

Functional relationships between variables. Linear? Nonlinear? Captures the "systematic" part of the data-generating process.

Component

Priors

Probability distributions over unknown parameters encoding beliefs before seeing data. Essential for Bayesian inference.

Choosing a Likelihood

The likelihood should match the nature of the outcome:

Outcome Type Common Likelihood Key Property
Continuous, unbounded Normal (Gaussian) Symmetric, thin tails
Continuous, positive Log-normal, Gamma Right-skewed, positive support
Count data Poisson, Negative binomial Discrete, non-negative
Binary Bernoulli (via logit/probit) 0/1 outcomes
Proportions (0-1) Beta Bounded continuous

Iteration as Learning

Model building is inherently iterative. We build, check, revise, and expand—learning about both the data and the phenomenon through this cycle. This is not a failure of method; it is the method.

The Scientific Modeling Cycle

flowchart TB Q[Define Question] --> S[Generative Story] S --> B[Build Model] B --> PP[Prior Predictive Check] PP -->|Implausible| S PP -->|Reasonable| F[Fit Model] F --> D[Diagnostics] D -->|Failures| B D -->|OK| PPC[Posterior Predictive Check] PPC -->|Model fails| E[Expand/Revise] PPC -->|Model adequate| R[Report Results] E --> B style Q fill:#f0f7e6,stroke:#6d8a4a style S fill:#e6f0f7,stroke:#4a6d8a style B fill:#f7f0e6,stroke:#8a6d4a style PP fill:#e6f7f0,stroke:#4a8a6d style F fill:#f0e6f7,stroke:#6d4a8a style D fill:#f7e6f0,stroke:#8a4a6d style PPC fill:#e6f0f7,stroke:#4a6d8a style E fill:#f7f0e6,stroke:#8a6d4a style R fill:#f0f7e6,stroke:#6d8a4a

Each iteration teaches us something:

Honest Iteration vs. Specification Shopping

There is a crucial distinction between legitimate scientific iteration and problematic "specification shopping." Both involve changing models, but they differ fundamentally in what drives the changes.

✓ Honest Scientific Iteration

Changes driven by:

  • Diagnostic failures (divergences, poor mixing)
  • Posterior predictive mismatches
  • Domain knowledge about missing structure
  • Pre-specified model expansion criteria

The goal is model improvement based on evidence of inadequacy.

✗ Specification Shopping

Changes driven by:

  • Coefficient has "wrong" sign
  • Effect not statistically significant
  • Results don't match expectations
  • ROI falls below threshold

The goal is obtaining desired results, not model improvement.

⚠️ Interactive: The Multiple Testing Problem

When you test multiple specifications and report the one with the best results, your false positive rate explodes. See how quickly the probability of finding "significant" results by chance increases with the number of specifications tested.

20
0.05
Effective False Positive Rate: 64.2%
Confidence Interval Validity: Severely Compromised

The Winner's Curse

Even when a true effect exists, selecting the specification with the highest t-statistic biases estimates upward. The "winning" specification likely benefited from favorable noise in that particular sample. This is why specification-shopped effects often fail to replicate: you selected on noise, not signal.

📊 Interactive: The Winner's Curse

Each specification produces a noisy estimate of the true effect. When you pick the "best" one, you systematically overestimate. The more specifications you test, the more biased your selected estimate.

0.20
0.10
20
Expected bias in selected estimate: +45%

Pre-Specification: The Solution

The solution to specification shopping is pre-specification: commit to your model structure before looking at results. This doesn't prevent iteration—you can still revise models—but it requires documenting and justifying changes, making the distinction between planned and exploratory analyses explicit.

Expanding Models

When a model fails predictive checks or sensitivity analysis reveals fragility, we may need to expand it. Model expansion should be driven by the nature of the failure, not by a desire for different results.

Observed Problem Possible Expansion
Residual autocorrelation Add lagged terms, AR errors, or state-space dynamics
Non-constant variance Model heteroskedasticity, use robust likelihood
Outliers in posterior predictive Use heavy-tailed distribution, mixture model
Group-level variation Hierarchical/multilevel structure
Non-linear patterns Flexible basis functions, splines, saturation curves

Predictive Checking

The primary tool for evaluating models is predictive checking: comparing model predictions to observed data. There are two forms:

Before Fitting

Prior Predictive

Generate data using only priors. Do the implied predictions look plausible? Could the observed data have come from this prior predictive distribution?

After Fitting

Posterior Predictive

Generate replicated data from the posterior. Compare statistics of replicated data to observed data. Where does the model fail to reproduce patterns?

Prior Predictive Checks

Before fitting to data, sample from priors and push through the model:

# Prior predictive check
with model:
    prior_pred = pm.sample_prior_predictive(samples=500)

# Examine the implied distribution of predictions
y_prior = prior_pred.prior_predictive["y"].values

# Check: are these values plausible for sales data?
# - All positive? (sales can't be negative)
# - Reasonable range? (not implying billion-dollar weeks)
# - Sensible variation? (not identical across all scenarios)

Posterior Predictive Checks

After fitting, generate replicated datasets and compare to observations:

# Posterior predictive check
with model:
    post_pred = pm.sample_posterior_predictive(trace)

# Compare observed and replicated data
y_rep = post_pred.posterior_predictive["y"].values
y_obs = data["sales"].values

# Visual checks
# - Does distribution of y_rep match distribution of y_obs?
# - Do time-series patterns match?
# - Are extreme values reproduced?

Sensitivity Analysis

Sensitivity analysis asks: how much do conclusions change if we change assumptions? Conclusions that are robust to reasonable alternative assumptions are more credible than those that depend on specific choices.

🔄 Interactive: Sensitivity Across Specifications

The same data analyzed with different reasonable specifications can yield vastly different conclusions. This visualization shows the distribution of coefficient estimates across a range of plausible model choices.

20
0.25

Each dot represents one specification's estimate. The red dashed line shows the "selected" estimate if you picked the most significant result. The gray band shows the honest uncertainty range.

What to Vary

External Validation

The strongest test of a model is its performance on data it has never seen—especially data generated by experimental intervention. For MMMs, this typically means comparing model predictions to results from geo-lift experiments or randomized holdout tests.

Why Experimental Validation Matters

Observational models can fit historical data well while producing biased causal estimates. Experimental validation provides ground truth against which to calibrate model predictions. Industry studies suggest two-thirds of uncalibrated MMM estimates require significant adjustment when compared to experimental results.

Starting Simple

A common mistake is to start with a complex model. Better practice:

A simple model that you understand deeply is more valuable than a complex model that produces mysteriously "better" results.

When to Stop

Model building could continue forever. Practical stopping rules:

Importantly, stopping is not the same as claiming the model is "correct." It means the model is adequate for current purposes, given current constraints.

Communicating Models

How you communicate model results is as important as the modeling itself. Key principles:

Uncertainty Is Information

Wide credible intervals are not a failure of analysis—they are honest communication that the data cannot distinguish between hypotheses. Wide credible intervals tell you "we need more data or experimentation to decide this confidently." That information is valuable: it prevents overconfident decisions and identifies where to invest in learning.

Summary: The Scientific Modeling Mindset

🎯

Question First

Start with the decision, not the technique. Let the question drive model structure.

📖

Tell the Story

Every model should have a generative story. If you can't simulate data, you don't understand the model.

🔄

Iterate Honestly

Model building is iterative—but changes should be driven by diagnostics, not desired results.

🔍

Check Everything

Prior predictive, posterior predictive, diagnostics, sensitivity. Trust but verify.

📊

Quantify Uncertainty

All models are wrong. Honest uncertainty quantification tells us how wrong they might be.

🗣️

Communicate Clearly

Report what you know, what you don't, and what would change your conclusions.

Further Reading