Bayesian Workflow

A principled approach to building, checking, and refining Bayesian models. This guide follows the framework established by Gelman et al. (2020) and Gabry et al. (2019), adapted for Marketing Mix Models with practical PyMC implementations.

Key Reference

This workflow is based on Bayesian Workflow (Gelman et al., 2020) and Visualization in Bayesian Workflow (Gabry et al., 2019, JRSS-A). The core insight: Bayesian analysis is iterative—building, fitting, checking, and expanding models in a principled cycle.

Introduction

Bayesian data analysis is about more than computing a posterior distribution. It is an iterative process of model building, inference, model checking, and model expansion. Each step informs the next, and visualization is indispensable at every stage.

For Marketing Mix Models, this workflow is especially important because:

Identification challenges: MMMs often have weak identification, making prior choice critical
Domain knowledge is rich: We have strong beliefs about plausible effect sizes and shapes
Stakes are high: Budget allocation decisions depend on getting the model right
Overfitting is tempting: With many channels and controls, specification shopping is easy

Beneath the mechanics, the seven phases share one purpose: protecting the causal interpretation of the model's attribution. Prior predictive checks encode marketing knowledge before the data can speak; sensitivity analysis asks whether the attribution survives perturbation. A model that passes these gates has earned the right to say "this is what your media caused," not merely "this is what fit best."

The Full Workflow

flowchart TB subgraph Build["1. Model Building"] A1[Define Generative Model] A2[Elicit Prior Distributions] A3[Encode Domain Knowledge] end subgraph Prior["2. Prior Predictive"] B1[Sample from Priors] B2[Generate Fake Data] B3[Check Plausibility] end subgraph Fit["3. Computational Fitting"] C1[Run MCMC Sampler] C2[Check Diagnostics] C3[Address Pathologies] end subgraph Post["4. Posterior Predictive"] D1[Generate Replicated Data] D2[Compare to Observed] D3[Check Calibration] end subgraph Compare["5. Model Comparison"] E1[Compute LOO-CV] E2[Compare ELPD] E3[Stack or Select] end subgraph Sense["6. Sensitivity Analysis"] F1[Vary Priors] F2[Check Data Influence] F3[Assess Robustness] end Build --> Prior Prior -->|Priors implausible| Build Prior -->|Priors ok| Fit Fit -->|Diagnostics fail| Build Fit -->|Diagnostics ok| Post Post -->|Model fails| Iterate Post -->|Model ok| Compare Compare --> Sense Sense --> Iterate subgraph Iterate["7. Model Expansion"] G1[Add Complexity] G2[Refine Structure] end Iterate -->|New model| Build style Build fill:#f0f7e6,stroke:#6d8a4a style Prior fill:#e6f0f7,stroke:#4a6d8a style Fit fill:#f7f0e6,stroke:#8a6d4a style Post fill:#e6f7f0,stroke:#4a8a6d style Compare fill:#f0e6f7,stroke:#6d4a8a style Sense fill:#f7e6f0,stroke:#8a4a6d style Iterate fill:#f7f7e6,stroke:#8a8a4a

The workflow is not strictly linear. At each stage, we may discover problems that send us back to an earlier stage. This is normal and expected—the goal is to build understanding, not just to produce a single "correct" answer.

Philosophy

The Generative Model Perspective

A Bayesian model is a generative model—a complete description of how data could have been generated. This includes the prior \(p(\theta)\), the likelihood \(p(y|\theta)\), and together they define the joint distribution \(p(y, \theta) = p(y|\theta)p(\theta)\). If you can't simulate data from your model, you don't fully understand it.

The workflow emphasizes several key principles:

Principle

Prediction First

Check models by their predictions—both before seeing data (prior predictive) and after fitting (posterior predictive). If predictions don't make sense, the model doesn't make sense.

Principle

Iterate Honestly

Model building is iterative, but changes should be driven by scientific reasoning, not by "trying things until the coefficients look right." Document every change.

Principle

Embrace Uncertainty

Report full posterior distributions, not just point estimates. Uncertainty is information— wide posteriors tell you the data can't distinguish between hypotheses.

1. Model Building

Model building starts with a generative story: how do we believe the data were generated? For an MMM, this story includes the baseline sales process, the effect of marketing activities, and the noise structure.

Prior Elicitation

Priors encode domain knowledge before seeing the data. Think of them as your media plan's institutional memory—what decades of marketing science says about plausible effect sizes, carryover, and saturation before this particular dataset speaks. In MMM, we often have strong beliefs:

Parameter	Domain Knowledge	Prior Implication
Media effects \(\beta\)	Advertising generally has positive, diminishing returns	Positive support, shrinkage toward zero
Adstock \(\alpha\)	Effects decay over days/weeks, rarely months	Beta prior concentrated in [0.3, 0.8]
Saturation \(\lambda\)	Diminishing returns set in at realistic spend levels	Prior centered on typical campaign spend
Seasonality amplitude	Sales vary by season but not 10x	Constrained to plausible range

Encoding Domain Knowledge

The prior is the formal mechanism for encoding domain knowledge. Consider the saturation parameter \(\lambda\) in the framework's default saturation function:

$$f(x; \lambda) = 1 - e^{-\lambda x}$$ (saturation)

This is exponential saturation: \(f(0) = 0\), concave, asymptoting to 1. (The config name SaturationConfig.logistic() is historical — the shipped function is the curve above, not the logistic sigmoid; see transforms/saturation.py. Hill, Michaelis–Menten, and tanh shapes are available per channel via SaturationConfig.) If we know that diminishing returns are well established around 1M USD spend, we can set a prior on \(\lambda\) such that \(f(1M)\) has most of its mass in a plausible range (e.g., 60–90% of maximum effect).

Deep diveWorked example: calibrating the saturation prior to a target effect level

# Prior on saturation parameter
# If typical_spend = 1M and we want ~70% of max effect at typical spend:
# solve: 0.7 = 1 - exp(-lambda * 1M)
# lambda = -ln(0.3) / 1M ≈ 1.20 / 1M

import pymc as pm

with pm.Model():
    # Lognormal prior on lambda, centered on expected value
    lam = pm.Lognormal(
        "lambda",
        mu=np.log(1.20 / typical_spend),
        sigma=0.5  # allows factor of ~3x variation
    )

# Note: the shipped model applies saturation to adstocked spend normalized
# to roughly [0, 1] per channel, so its default prior sat_lam ~ Exponential(0.5)
# lives on that normalized scale, not on raw dollars.

2. Prior Predictive Checks

Before fitting the model to data, we sample from the prior predictive distribution:

$$p(y^{\text{rep}}) = \int p(y^{\text{rep}}|\theta) \, p(\theta) \, d\theta$$ (prior predictive)

This generates fake datasets from the model using only the priors—no observed data involved. The question is: could the observed data plausibly have come from this generative process?

Prior Pushforward

The prior pushforward is the distribution of derived quantities (predictions, effect sizes, ROI) implied by the prior. Even if individual parameter priors seem reasonable, their combination might imply absurd predictions. Checking the pushforward catches these problems.

Interactive: Prior Predictive Check

Adjust the prior parameters and see how they affect the prior predictive distribution of sales.

Media Effect σ: 0.5

Intercept σ: 0.3

Weakly Informative Priors

A prior is weakly informative if the prior predictive distribution:

Has mass on plausible datasets (including extreme but possible cases)
Has no mass on impossible datasets (negative sales, infinite values)
Is broader than the observed data (not cherry-picked to match)

Common Mistake: Flat Priors

"Non-informative" or flat priors on parameters often imply highly informative priors on predictions. For example, flat priors on many regression coefficients imply that extreme predictions are just as likely as moderate ones. Always check the prior pushforward.

# Prior predictive check in PyMC
with model:
    prior_pred = pm.sample_prior_predictive(samples=500)

# Visualize: do these look like plausible sales data?
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
    ax.plot(prior_pred.prior_predictive["y_obs"][0, i])
    ax.set_title(f"Draw {i+1}")
plt.suptitle("Prior Predictive Samples")
plt.tight_layout()

3. Computational Fitting

Once satisfied with the prior predictive, we fit the model using MCMC (typically NUTS in PyMC). The goal is to obtain samples from the posterior:

$$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} \propto p(y|\theta)p(\theta)$$ (Bayes' theorem)

MCMC Diagnostics

MCMC produces samples from the posterior, not the posterior itself. We must verify that the samples are reliable:

Diagnostic	Target	What it Checks
\(\hat{R}\) (R-hat)	< 1.01	Chains have converged to same distribution
ESS (bulk)	> 400	Enough independent samples for central tendency
ESS (tail)	> 400	Enough independent samples for tail quantiles
Divergences	0	Sampler exploring geometry correctly
Tree depth	< max	Sampler not struggling with curvature

Interactive: Trace Plot Diagnostics

Good chains look like "hairy caterpillars"—stationary, well-mixed, and indistinguishable.

Divergences and Pathologies

Divergences Indicate Geometry Problems

Divergent transitions mean the sampler encountered regions of high curvature it couldn't explore reliably. Common causes: non-centered parameterizations needed, funnel geometries, multimodality, or model misspecification. Never ignore divergences.

Solutions to common MCMC problems:

Problem

Divergences

Increase target_accept (0.9 → 0.95 → 0.99)
Use non-centered parameterization for hierarchical models
Reparameterize to reduce posterior curvature
Check for multimodality or unidentified parameters

Problem

Low ESS / High R-hat

Run more iterations
Check for slow-mixing parameters (treedepth warnings)
Improve initialization
Consider tighter priors if identification is weak

# Standard fitting with diagnostics
with model:
    trace = pm.sample(
        draws=2000,
        tune=2000,
        chains=4,
        target_accept=0.95,
        random_seed=42,
        return_inferencedata=True
    )

# Check diagnostics
print(az.summary(trace, var_names=["beta", "alpha", "lambda"]))
# Look for: R-hat < 1.01, ESS > 400, no divergences

4. Posterior Predictive Checks

After fitting, we generate data from the posterior predictive distribution:

$$p(y^{\text{rep}}|y) = \int p(y^{\text{rep}}|\theta) \, p(\theta|y) \, d\theta$$ (posterior predictive)

If the model is good, replicated data \(y^{\text{rep}}\) should look like the observed data \(y\). Systematic differences indicate model misspecification.

Interactive: Posterior Predictive Check

Compare observed data (dark line) to posterior predictive draws (light lines).

Calibration

A model is calibrated if its stated uncertainty is accurate. For example, 90% prediction intervals should contain the true value 90% of the time.

Probability Integral Transform (PIT)

If the posterior predictive is calibrated, the PIT values \(F_{y^{\text{rep}}}(y_{\text{obs}})\)—the CDF of the predictive distribution evaluated at the observed value—should be uniformly distributed. Deviations from uniformity indicate calibration problems.

Leave-One-Out Cross-Validation

LOO-CV estimates out-of-sample predictive performance without actually refitting the model for each held-out observation. PSIS-LOO uses importance sampling to approximate LOO efficiently:

# Compute LOO
loo = az.loo(trace, pointwise=True)
print(loo)

# Check Pareto k diagnostics
# k > 0.7: that observation is highly influential, LOO approximation unreliable
high_k = np.where(loo.pareto_k > 0.7)[0]
if len(high_k) > 0:
    print(f"Warning: {len(high_k)} observations with k > 0.7")

5. Model Comparison

When we have multiple candidate models, we compare them using their expected predictive performance. The key quantity is the expected log pointwise predictive density (ELPD):

$$\text{ELPD} = \sum_{i=1}^n \log p(y_i | y_{-i})$$ (leave-one-out ELPD)

ELPD and LOO-CV

Higher ELPD is better—it means the model makes better predictions on held-out data. The standard error of the ELPD difference tells us whether the difference is meaningful.

# Compare multiple models
comparison = az.compare({
    "standard": trace_standard,
    "nested": trace_nested,
    "hierarchical": trace_hierarchical
})
print(comparison)

# Visualize comparison
az.plot_compare(comparison)

Model Weights

Rather than selecting a single "best" model, consider stacking weights that combine models based on their predictive performance. This accounts for model uncertainty and often gives better predictions than any single model.

Model Topology

Gelman et al. (2020) recommend thinking of models as a topology—a network of related models that can be expanded or simplified. Each model comparison teaches us something about the structure of the problem. Below is the full model topology for the MMM Framework:

flowchart TB subgraph Core["Core Model Types"] STD[Standard MMM
Single outcome] NEST[Nested MMM
+ Mediators] MV[Multivariate MMM
+ Cross-effects] COMB[Combined MMM
Mediators + Cross-effects] end subgraph Hier["Hierarchical Extensions"] H_INT[+ Random Intercepts
Geo-level baseline] H_SLOPE[+ Random Slopes
Geo-level response] H_FULL[Full Hierarchical
Partial pooling] end subgraph VarSel["Variable Selection"] VS_HS[Regularized Horseshoe
Sparse precision controls] VS_SS[Spike-and-Slab
Discrete selection] VS_BL[Bayesian LASSO
L1 regularization] end STD --> NEST STD --> MV NEST --> COMB MV --> COMB STD --> H_INT NEST --> H_INT MV --> H_INT COMB --> H_INT H_INT --> H_SLOPE H_SLOPE --> H_FULL H_FULL -.-> VS_HS H_FULL -.-> VS_SS H_FULL -.-> VS_BL style STD fill:#f0f7e6,stroke:#6d8a4a style NEST fill:#e6f0f7,stroke:#4a6d8a style MV fill:#f7f0e6,stroke:#8a6d4a style COMB fill:#f0e6f7,stroke:#6d4a8a style H_FULL fill:#e6f7f0,stroke:#4a8a6d style VS_HS fill:#f7e6e6,stroke:#8a4a4a

Model Selection Guide

Scenario	Model Choice	Key Considerations
Single outcome, national data	Standard MMM	Start here; acknowledge wide uncertainty
Upper-funnel metrics available	Nested MMM	Awareness/consideration as mediators; direct/indirect decomposition
Product portfolio effects	Multivariate MMM	Cannibalization, halo effects; correlated errors via LKJ
Full funnel + portfolio	Combined MMM	Highest complexity; requires strong identification
Multi-geo, national media	+ Random Intercepts	Geo baselines; not geo-level media response
Multi-geo, regional media	+ Random Slopes	Geo-level variation in media creates identification
Many precision controls	+ Variable Selection	Horseshoe/spike-slab on controls only; never on media

Variable Selection Caution

Variable selection priors (horseshoe, spike-and-slab) should only be applied to precision control variables—not to confounders, mediators, or media channels. Applying selection priors to causal variables can severely bias effect estimates. See the Variable Selection Guide for details.

6. Sensitivity Analysis

How much do our conclusions depend on our modeling choices? Sensitivity analysis systematically varies assumptions to assess robustness. For an MMM, this is the causal stress test: if a channel's attribution survives reasonable perturbations of priors, trend, and influential weeks, you can treat it as a finding rather than an artifact of one specification.

Prior Sensitivity

Refit the model with different priors and compare posteriors. If conclusions change dramatically with reasonable prior variations, the data are not strongly informative about those parameters.

Interactive: Prior Sensitivity

See how the posterior changes as the prior becomes more or less informative.

Prior SD: 0.5

Data Influence

The Pareto k diagnostics from LOO-CV identify influential observations. High-influence observations warrant investigation:

Are they outliers? Consider robust likelihoods (Student-t)
Are they unusual periods? (Promotions, COVID, competitive activity)
Are they data errors? Check the raw data

7. Model Expansion

Based on posterior predictive checks and domain understanding, we iteratively expand the model to address deficiencies. The MMM Framework provides structured expansion paths:

Standard → Nested

Add Mediators

When upper-funnel metrics (awareness, consideration) are available, add them as mediators to decompose effects into direct and indirect pathways.

Fully observed: complete time series
Partially observed: sparse survey data
Fully latent: requires strong priors

Standard → Multivariate

Add Cross-Effects

When modeling product portfolios, add cross-effects to capture cannibalization or halo effects between outcomes.

Cannibalization: negative cross-effect
Halo: positive cross-effect
LKJ prior on error correlations

Any → Hierarchical

Add Partial Pooling

With multi-geo data, add hierarchical structure to borrow strength across regions while capturing heterogeneity.

Random intercepts: geo baselines
Random slopes: requires geo media variation
Non-centered for divergence control

Expansion

Improve Time Structure

If residuals are autocorrelated, add more flexible trend/seasonality:

Piecewise linear trend with changepoints
Gaussian Process trend for smooth variation
Fourier seasonality with more harmonics
Holiday/event effects

Expansion

Refine Transformations

If saturation/carryover curves don't match patterns:

Logistic vs. Hill saturation functions
Geometric vs. Weibull adstock decay
Channel-specific vs. shared parameters
Time-varying saturation for mature brands

Expansion Decision Tree

flowchart TD A[Posterior Predictive Check] --> B{Systematic
Residual Pattern?} B -->|No| C[Model OK] B -->|Yes| D{What type?} D -->|Autocorrelation| E[Add Trend/AR] D -->|Geo heterogeneity| F[Add Hierarchical] D -->|Cross-outcome| G[Add Multivariate] D -->|Mediation signal| H[Add Nested] E --> I{Improved ELPD?} F --> I G --> I H --> I I -->|Yes, > 1 SE| J[Accept Expansion] I -->|No| K[Reject / Investigate] J --> A K --> L[Check Priors/
Data Quality] style C fill:#e6f7e6,stroke:#4a8a4a style J fill:#e6f0f7,stroke:#4a6d8a style K fill:#f7e6e6,stroke:#8a4a4a

When to Stop

Model expansion has diminishing returns. Stop when:

Posterior predictive checks show no systematic problems
Additional complexity doesn't improve ELPD beyond standard error
The model answers the scientific questions with adequate precision
You've started "shopping" for specifications that give desired answers

Avoid Specification Shopping

If you try many models and only report the one with "reasonable" coefficients, you're specification shopping. This invalidates uncertainty quantification and leads to overconfident conclusions. Document all models tried and all decisions made.

MMM-Specific Workflow

Applying the Bayesian workflow to Marketing Mix Models involves specific considerations for the MMM Framework's model classes:

Workflow by Model Type

Standard → Nested Workflow

Fit Standard MMM first — Establish baseline effects and diagnostics
Check mediator data — Is it fully observed, sparse, or absent?
Configure mediator type — FULLY_OBSERVED, PARTIALLY_OBSERVED, or FULLY_LATENT
Map channels to mediators — Not all channels need to affect all mediators
Compare ELPD — Does the nested model improve predictions?
Interpret decomposition — Direct vs. indirect effects, proportion mediated

Standard → Multivariate Workflow

Fit separate Standard MMMs — One per outcome, check individual diagnostics
Check residual correlations — Are outcome residuals correlated?
Configure outcomes — OutcomeConfig for each with separate/shared transformations
Add cross-effects — CANNIBALIZATION, HALO, or UNCONSTRAINED (the CrossEffectType members; also SYMMETRIC / ASYMMETRIC)
Set LKJ prior — η=1 (uniform), η=2 (shrink toward independence)
Interpret cross-effects — Effect of outcome A on outcome B

MMM Prior Guidance

For each major parameter class in the framework, here is guidance on setting informative priors:

Deep diveShipped default priors by parameter class

Parameter	Shipped Default	Rationale
Media coefficient β (standardized scale)	`Gamma(μ=1.5, σ=1.0)`	Positive effects with mass spread around 1 on the standardized max-contribution scale; overridden per channel when an experiment-calibrated `roi_prior` is set
Adstock — parametric (the default since June 2026)	`α ~ Beta(1, 3)` (geometric)	Continuous in-graph decay, prior mean 0.25 (short memory); delayed/Weibull kernels have their own priors per `MediaChannelConfig.adstock`. Made the default after pressure testing measured the legacy blend at ~28% error vs ~7% parametric on carryover-sensitive worlds
Adstock — legacy blend (opt-out)	`adstock_mix ~ Beta(2, 2)`	A learned blend weight between two fixed geometric decays (bank α ∈ {0.0, 0.3, 0.5, 0.7, 0.9}); the pre-change default, available via `use_parametric_adstock=False` for reproducing older fits
Saturation λ (exponential-saturation, config name "logistic")	`Exponential(λ=0.5)`	On adstocked spend normalized to ≈[0, 1]; prior mean 2 puts ~86% of max effect at the historical spend peak
Mediator→Outcome γ (NestedMMM)	`HalfNormal(σ=1)`	Positive awareness→sales; scale depends on units
Cross-effect (cannib., MultivariateMMM)	`−HalfNormal(σ=0.3)`	Constrained negative; typically small relative to main effects
Hierarchical σ (geo / product offsets)	`HalfNormal(σ=0.3)`	Prior on cross-sectional variation; weakly informative
LKJ η (correlations, MultivariateMMM)	`η=2.0`	Mild shrinkage toward independence; η=1 is uniform
Horseshoe τ (global, control selection)	`D₀/(D-D₀) · σ/√n`	Calibrated by expected # nonzero (Piironen & Vehtari, 2017)

Defaults above are read from the shipped code (model/base.py, config/transforms.py) as of 0.2.0 — if this table and the code ever disagree, the code wins and this page is stale.

# Example: configure priors per channel with the shipped config objects
from mmm_framework.config import (
    AdstockConfig, MediaChannelConfig, PriorConfig, SaturationConfig,
)

tv = MediaChannelConfig(
    name="TV",
    adstock=AdstockConfig.geometric(          # parametric path (the default;
        l_max=8,                              # ModelConfig.use_parametric_adstock=True)
        alpha_prior=PriorConfig.beta(alpha=2, beta=2),  # longer-memory belief for TV
    ),
    saturation=SaturationConfig.logistic(     # exponential-saturation form 1 - e^(-λx)
        lam_prior=PriorConfig.gamma(alpha=2.0, beta=1.0),
    ),
)

Validation Strategies

MMM validation goes beyond statistical checks. The framework supports multiple validation approaches:

Internal

Holdout Validation

Hold out recent periods and assess prediction accuracy using the framework's predict() method.

MAPE, RMSE on holdout
Coverage of prediction intervals
⚠️ Tests extrapolation, not causality

External

Geo Experiments

Compare MMM estimates to randomized geo experiments—the gold standard for causal validation.

Match lift estimates to geo tests
Assess coverage of credible intervals
Calibrate priors if systematic bias

Business

Plausibility Checks

Verify results against domain knowledge and industry benchmarks.

ROI within industry norms
Adstock aligns with creative wear-out
Saturation at plausible spend levels

Framework Validation Code

# Holdout validation: rolling-origin backtest (refits on expanding
# training windows, grades genuine out-of-time forecasts)
from mmm_framework.validation import BacktestConfig, run_backtest

result = run_backtest(
    BayesianMMM(panel, config),
    BacktestConfig(min_train_size=104, horizon=13, step=13),
)
print(result.summary())    # MAPE/MASE vs naive + seasonal-naive baselines
print(result.coverage())   # nominal vs empirical interval coverage (~90% at 0.9)

# Compare to a geo experiment: the model's implied lift for the channel
contrib = model.compute_counterfactual_contributions(
    compute_uncertainty=True, hdi_prob=0.90
)
mmm_lift = contrib.total_contributions["TV"] / baseline_sales
print(f"MMM lift: {mmm_lift:.1%} "
      f"[{contrib.contribution_hdi_low['TV'] / baseline_sales:.1%}, "
      f"{contrib.contribution_hdi_high['TV'] / baseline_sales:.1%}]")
print(f"Geo test lift: {0.12:.1%}")  # from the experiment readout
# If the geo result falls outside the MMM interval, calibrate:
# feed the experiment in as a likelihood term via add_experiment_calibration()
# or compare systematically with ModelValidator + ValidationConfigBuilder
# .with_calibration([LiftTestResult(...)]).

Documentation is Critical

Maintain a log of all modeling decisions: priors chosen, diagnostics observed, models compared, and rationale for expansions. The framework's model.save() method preserves configs, but document the why behind each choice separately. This prevents specification shopping and enables honest reporting.

References

Gelman, A., Vehtari, A., Simpson, D., et al. (2020). Bayesian Workflow. arXiv:2011.01808.
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. (2019). Visualization in Bayesian workflow. J. R. Stat. Soc. A, 182(2), 389-402.
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413-1432.
Betancourt, M. (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.