Bayesian Workflow
A principled approach to building, checking, and refining Bayesian models. This guide follows the framework established by Gelman et al. (2020) and Gabry et al. (2019), adapted for Marketing Mix Models with practical PyMC implementations.
Key Reference
This workflow is based on Bayesian Workflow (Gelman et al., 2020) and Visualization in Bayesian Workflow (Gabry et al., 2019, JRSS-A). The core insight: Bayesian analysis is iterative—building, fitting, checking, and expanding models in a principled cycle.
Introduction
Bayesian data analysis is about more than computing a posterior distribution. It is an iterative process of model building, inference, model checking, and model expansion. Each step informs the next, and visualization is indispensable at every stage.
For Marketing Mix Models, this workflow is especially important because:
- Identification challenges: MMMs often have weak identification, making prior choice critical
- Domain knowledge is rich: We have strong beliefs about plausible effect sizes and shapes
- Stakes are high: Budget allocation decisions depend on getting the model right
- Overfitting is tempting: With many channels and controls, specification shopping is easy
The Full Workflow
The workflow is not strictly linear. At each stage, we may discover problems that send us back to an earlier stage. This is normal and expected—the goal is to build understanding, not just to produce a single "correct" answer.
Philosophy
The Generative Model Perspective
A Bayesian model is a generative model—a complete description of how data could have been generated. This includes the prior \(p(\theta)\), the likelihood \(p(y|\theta)\), and together they define the joint distribution \(p(y, \theta) = p(y|\theta)p(\theta)\). If you can't simulate data from your model, you don't fully understand it.
The workflow emphasizes several key principles:
Prediction First
Check models by their predictions—both before seeing data (prior predictive) and after fitting (posterior predictive). If predictions don't make sense, the model doesn't make sense.
Iterate Honestly
Model building is iterative, but changes should be driven by scientific reasoning, not by "trying things until the coefficients look right." Document every change.
Embrace Uncertainty
Report full posterior distributions, not just point estimates. Uncertainty is information— wide posteriors tell you the data can't distinguish between hypotheses.
1. Model Building
Model building starts with a generative story: how do we believe the data were generated? For an MMM, this story includes the baseline sales process, the effect of marketing activities, and the noise structure.
Prior Elicitation
Priors encode domain knowledge before seeing the data. In MMM, we often have strong beliefs:
| Parameter | Domain Knowledge | Prior Implication |
|---|---|---|
| Media effects \(\beta\) | Advertising generally has positive, diminishing returns | Positive support, shrinkage toward zero |
| Adstock \(\alpha\) | Effects decay over days/weeks, rarely months | Beta prior concentrated in [0.3, 0.8] |
| Saturation \(\lambda\) | Diminishing returns set in at realistic spend levels | Prior centered on typical campaign spend |
| Seasonality amplitude | Sales vary by season but not 10x | Constrained to plausible range |
Encoding Domain Knowledge
The prior is the formal mechanism for encoding domain knowledge. Consider the saturation parameter \(\lambda\) in a logistic saturation function:
If we know that saturation typically occurs around 1M USD spend, we can set a prior on \(\lambda\) such that \(f(1M)\) has most of its mass in a plausible range (e.g., 60-90% of maximum effect).
# Prior on saturation parameter
# If typical_spend = 1M and we want 70% saturation at typical spend:
# solve: 0.7 = 1/(1 + exp(-lambda * 1M))
# lambda ≈ 0.85 / 1M
import pymc as pm
with pm.Model():
# Lognormal prior on lambda, centered on expected value
lam = pm.Lognormal(
"lambda",
mu=np.log(0.85 / typical_spend),
sigma=0.5 # allows factor of ~3x variation
)
2. Prior Predictive Checks
Before fitting the model to data, we sample from the prior predictive distribution:
This generates fake datasets from the model using only the priors—no observed data involved. The question is: could the observed data plausibly have come from this generative process?
Prior Pushforward
Prior Pushforward
The prior pushforward is the distribution of derived quantities (predictions, effect sizes, ROI) implied by the prior. Even if individual parameter priors seem reasonable, their combination might imply absurd predictions. Checking the pushforward catches these problems.
Interactive: Prior Predictive Check
Adjust the prior parameters and see how they affect the prior predictive distribution of sales.
Weakly Informative Priors
A prior is weakly informative if the prior predictive distribution:
- Has mass on plausible datasets (including extreme but possible cases)
- Has no mass on impossible datasets (negative sales, infinite values)
- Is broader than the observed data (not cherry-picked to match)
Common Mistake: Flat Priors
"Non-informative" or flat priors on parameters often imply highly informative priors on predictions. For example, flat priors on many regression coefficients imply that extreme predictions are just as likely as moderate ones. Always check the prior pushforward.
# Prior predictive check in PyMC
with model:
prior_pred = pm.sample_prior_predictive(samples=500)
# Visualize: do these look like plausible sales data?
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
ax.plot(prior_pred.prior_predictive["y_obs"][0, i])
ax.set_title(f"Draw {i+1}")
plt.suptitle("Prior Predictive Samples")
plt.tight_layout()
3. Computational Fitting
Once satisfied with the prior predictive, we fit the model using MCMC (typically NUTS in PyMC). The goal is to obtain samples from the posterior:
MCMC Diagnostics
MCMC produces samples from the posterior, not the posterior itself. We must verify that the samples are reliable:
| Diagnostic | Target | What it Checks |
|---|---|---|
| \(\hat{R}\) (R-hat) | < 1.01 | Chains have converged to same distribution |
| ESS (bulk) | > 400 | Enough independent samples for central tendency |
| ESS (tail) | > 400 | Enough independent samples for tail quantiles |
| Divergences | 0 | Sampler exploring geometry correctly |
| Tree depth | < max | Sampler not struggling with curvature |
Interactive: Trace Plot Diagnostics
Good chains look like "hairy caterpillars"—stationary, well-mixed, and indistinguishable.
Divergences and Pathologies
Divergences Indicate Geometry Problems
Divergent transitions mean the sampler encountered regions of high curvature it couldn't explore reliably. Common causes: non-centered parameterizations needed, funnel geometries, multimodality, or model misspecification. Never ignore divergences.
Solutions to common MCMC problems:
Divergences
- Increase
target_accept(0.9 → 0.95 → 0.99) - Use non-centered parameterization for hierarchical models
- Reparameterize to reduce posterior curvature
- Check for multimodality or unidentified parameters
Low ESS / High R-hat
- Run more iterations
- Check for slow-mixing parameters (treedepth warnings)
- Improve initialization
- Consider tighter priors if identification is weak
# Standard fitting with diagnostics
with model:
trace = pm.sample(
draws=2000,
tune=2000,
chains=4,
target_accept=0.95,
random_seed=42,
return_inferencedata=True
)
# Check diagnostics
print(az.summary(trace, var_names=["beta", "alpha", "lambda"]))
# Look for: R-hat < 1.01, ESS > 400, no divergences
4. Posterior Predictive Checks
After fitting, we generate data from the posterior predictive distribution:
If the model is good, replicated data \(y^{\text{rep}}\) should look like the observed data \(y\). Systematic differences indicate model misspecification.
Interactive: Posterior Predictive Check
Compare observed data (dark line) to posterior predictive draws (light lines).
Calibration
A model is calibrated if its stated uncertainty is accurate. For example, 90% prediction intervals should contain the true value 90% of the time.
Probability Integral Transform (PIT)
If the posterior predictive is calibrated, the PIT values \(F_{y^{\text{rep}}}(y_{\text{obs}})\)—the CDF of the predictive distribution evaluated at the observed value—should be uniformly distributed. Deviations from uniformity indicate calibration problems.
Leave-One-Out Cross-Validation
LOO-CV estimates out-of-sample predictive performance without actually refitting the model for each held-out observation. PSIS-LOO uses importance sampling to approximate LOO efficiently:
# Compute LOO
loo = az.loo(trace, pointwise=True)
print(loo)
# Check Pareto k diagnostics
# k > 0.7: that observation is highly influential, LOO approximation unreliable
high_k = np.where(loo.pareto_k > 0.7)[0]
if len(high_k) > 0:
print(f"Warning: {len(high_k)} observations with k > 0.7")
5. Model Comparison
When we have multiple candidate models, we compare them using their expected predictive performance. The key quantity is the expected log pointwise predictive density (ELPD):
ELPD and LOO-CV
Higher ELPD is better—it means the model makes better predictions on held-out data. The standard error of the ELPD difference tells us whether the difference is meaningful.
# Compare multiple models
comparison = az.compare({
"standard": trace_standard,
"nested": trace_nested,
"hierarchical": trace_hierarchical
})
print(comparison)
# Visualize comparison
az.plot_compare(comparison)
Model Weights
Rather than selecting a single "best" model, consider stacking weights that combine models based on their predictive performance. This accounts for model uncertainty and often gives better predictions than any single model.
Model Topology
Gelman et al. (2020) recommend thinking of models as a topology—a network of related models that can be expanded or simplified. Each model comparison teaches us something about the structure of the problem. Below is the full model topology for the MMM Framework:
Single outcome] NEST[Nested MMM
+ Mediators] MV[Multivariate MMM
+ Cross-effects] COMB[Combined MMM
Mediators + Cross-effects] end subgraph Hier["Hierarchical Extensions"] H_INT[+ Random Intercepts
Geo-level baseline] H_SLOPE[+ Random Slopes
Geo-level response] H_FULL[Full Hierarchical
Partial pooling] end subgraph VarSel["Variable Selection"] VS_HS[Regularized Horseshoe
Sparse precision controls] VS_SS[Spike-and-Slab
Discrete selection] VS_BL[Bayesian LASSO
L1 regularization] end STD --> NEST STD --> MV NEST --> COMB MV --> COMB STD --> H_INT NEST --> H_INT MV --> H_INT COMB --> H_INT H_INT --> H_SLOPE H_SLOPE --> H_FULL H_FULL -.-> VS_HS H_FULL -.-> VS_SS H_FULL -.-> VS_BL style STD fill:#f0f7e6,stroke:#6d8a4a style NEST fill:#e6f0f7,stroke:#4a6d8a style MV fill:#f7f0e6,stroke:#8a6d4a style COMB fill:#f0e6f7,stroke:#6d4a8a style H_FULL fill:#e6f7f0,stroke:#4a8a6d style VS_HS fill:#f7e6e6,stroke:#8a4a4a
Model Selection Guide
| Scenario | Model Choice | Key Considerations |
|---|---|---|
| Single outcome, national data | Standard MMM | Start here; acknowledge wide uncertainty |
| Upper-funnel metrics available | Nested MMM | Awareness/consideration as mediators; direct/indirect decomposition |
| Product portfolio effects | Multivariate MMM | Cannibalization, halo effects; correlated errors via LKJ |
| Full funnel + portfolio | Combined MMM | Highest complexity; requires strong identification |
| Multi-geo, national media | + Random Intercepts | Geo baselines; not geo-level media response |
| Multi-geo, regional media | + Random Slopes | Geo-level variation in media creates identification |
| Many precision controls | + Variable Selection | Horseshoe/spike-slab on controls only; never on media |
Variable Selection Caution
Variable selection priors (horseshoe, spike-and-slab) should only be applied to precision control variables—not to confounders, mediators, or media channels. Applying selection priors to causal variables can severely bias effect estimates. See the Variable Selection Guide for details.
6. Sensitivity Analysis
How much do our conclusions depend on our modeling choices? Sensitivity analysis systematically varies assumptions to assess robustness.
Prior Sensitivity
Refit the model with different priors and compare posteriors. If conclusions change dramatically with reasonable prior variations, the data are not strongly informative about those parameters.
Interactive: Prior Sensitivity
See how the posterior changes as the prior becomes more or less informative.
Data Influence
The Pareto k diagnostics from LOO-CV identify influential observations. High-influence observations warrant investigation:
- Are they outliers? Consider robust likelihoods (Student-t)
- Are they unusual periods? (Promotions, COVID, competitive activity)
- Are they data errors? Check the raw data
7. Model Expansion
Based on posterior predictive checks and domain understanding, we iteratively expand the model to address deficiencies. The MMM Framework provides structured expansion paths:
Add Mediators
When upper-funnel metrics (awareness, consideration) are available, add them as mediators to decompose effects into direct and indirect pathways.
- Fully observed: complete time series
- Partially observed: sparse survey data
- Fully latent: requires strong priors
Add Cross-Effects
When modeling product portfolios, add cross-effects to capture cannibalization or halo effects between outcomes.
- Cannibalization: negative cross-effect
- Halo: positive cross-effect
- LKJ prior on error correlations
Add Partial Pooling
With multi-geo data, add hierarchical structure to borrow strength across regions while capturing heterogeneity.
- Random intercepts: geo baselines
- Random slopes: requires geo media variation
- Non-centered for divergence control
Improve Time Structure
If residuals are autocorrelated, add more flexible trend/seasonality:
- Piecewise linear trend with changepoints
- Gaussian Process trend for smooth variation
- Fourier seasonality with more harmonics
- Holiday/event effects
Refine Transformations
If saturation/carryover curves don't match patterns:
- Logistic vs. Hill saturation functions
- Geometric vs. Weibull adstock decay
- Channel-specific vs. shared parameters
- Time-varying saturation for mature brands
Expansion Decision Tree
Residual Pattern?} B -->|No| C[Model OK] B -->|Yes| D{What type?} D -->|Autocorrelation| E[Add Trend/AR] D -->|Geo heterogeneity| F[Add Hierarchical] D -->|Cross-outcome| G[Add Multivariate] D -->|Mediation signal| H[Add Nested] E --> I{Improved ELPD?} F --> I G --> I H --> I I -->|Yes, > 1 SE| J[Accept Expansion] I -->|No| K[Reject / Investigate] J --> A K --> L[Check Priors/
Data Quality] style C fill:#e6f7e6,stroke:#4a8a4a style J fill:#e6f0f7,stroke:#4a6d8a style K fill:#f7e6e6,stroke:#8a4a4a
When to Stop
Model expansion has diminishing returns. Stop when:
- Posterior predictive checks show no systematic problems
- Additional complexity doesn't improve ELPD beyond standard error
- The model answers the scientific questions with adequate precision
- You've started "shopping" for specifications that give desired answers
Avoid Specification Shopping
If you try many models and only report the one with "reasonable" coefficients, you're specification shopping. This invalidates uncertainty quantification and leads to overconfident conclusions. Document all models tried and all decisions made.
MMM-Specific Workflow
Applying the Bayesian workflow to Marketing Mix Models involves specific considerations for the MMM Framework's model classes:
Workflow by Model Type
Standard → Nested Workflow
- Fit Standard MMM first — Establish baseline effects and diagnostics
- Check mediator data — Is it fully observed, sparse, or absent?
- Configure mediator type —
FULLY_OBSERVED,PARTIALLY_OBSERVED, orFULLY_LATENT - Map channels to mediators — Not all channels need to affect all mediators
- Compare ELPD — Does the nested model improve predictions?
- Interpret decomposition — Direct vs. indirect effects, proportion mediated
Standard → Multivariate Workflow
- Fit separate Standard MMMs — One per outcome, check individual diagnostics
- Check residual correlations — Are outcome residuals correlated?
- Configure outcomes —
OutcomeConfigfor each with separate/shared transformations - Add cross-effects —
CANNIBALIZATION,HALO, orSPILLOVER - Set LKJ prior — η=1 (uniform), η=2 (shrink toward independence)
- Interpret cross-effects — Effect of outcome A on outcome B
MMM Prior Guidance
For each major parameter class in the framework, here is guidance on setting informative priors:
| Parameter | Framework Default | Rationale |
|---|---|---|
| Media β (log scale) | HalfNormal(σ=0.5) |
Positive effects; ~5% lift per unit at median |
| Adstock α | Beta(α=3, β=3) |
Centered at 0.5; most mass in [0.2, 0.8] |
| Saturation λ (logistic) | Gamma(α=2, β=1/x̄) |
Scaled to mean spend; 50% saturation at typical level |
| Mediator→Outcome γ | HalfNormal(σ=1) |
Positive awareness→sales; scale depends on units |
| Cross-effect (cannib.) | HalfNormal(σ=0.3) |
Constrained negative; typically small relative to main effects |
| Hierarchical σ (geo) | HalfNormal(σ=0.5) |
Prior on geo-level variation; weakly informative |
| LKJ η (correlations) | η=2.0 |
Mild shrinkage toward independence; η=1 is uniform |
| Horseshoe τ (global) | D₀/(D-D₀) · σ/√n |
Calibrated by expected # nonzero (Piironen & Vehtari, 2017) |
# Example: Configure priors in the framework
from mmm_framework import ModelConfigBuilder, AdstockConfigBuilder, SaturationConfigBuilder
config = (
ModelConfigBuilder()
.with_adstock(
AdstockConfigBuilder()
.with_type("geometric")
.with_alpha_prior("Beta", alpha=3, beta=3) # Centered at 0.5
.build()
)
.with_saturation(
SaturationConfigBuilder()
.with_type("logistic")
.with_lam_prior("Gamma", alpha=2, beta=1/mean_spend) # Scale to data
.build()
)
.with_media_prior("HalfNormal", sigma=0.5) # Positive effects
.build()
)
Validation Strategies
MMM validation goes beyond statistical checks. The framework supports multiple validation approaches:
Holdout Validation
Hold out recent periods and assess prediction accuracy using the framework's
predict() method.
- MAPE, RMSE on holdout
- Coverage of prediction intervals
- ⚠️ Tests extrapolation, not causality
Geo Experiments
Compare MMM estimates to randomized geo experiments—the gold standard for causal validation.
- Match lift estimates to geo tests
- Assess coverage of credible intervals
- Calibrate priors if systematic bias
Plausibility Checks
Verify results against domain knowledge and industry benchmarks.
- ROI within industry norms
- Adstock aligns with creative wear-out
- Saturation at plausible spend levels
Framework Validation Code
# Holdout validation
train_data, test_data = panel.split(holdout_periods=8)
model = BayesianMMM(train_data, config)
model.fit()
# Predict on holdout
predictions = model.predict(test_data)
coverage = model.compute_coverage(test_data, ci=0.9)
print(f"90% CI Coverage: {coverage:.1%}") # Should be ~90%
# Compare to geo experiment
geo_test_lift = 0.12 # 12% lift from experiment
mmm_lift = model.compute_channel_lift("tv", scenario="actual_vs_zero")
print(f"MMM Lift: {mmm_lift.mean():.1%} [{mmm_lift.hdi(0.9)}]")
print(f"Geo Test Lift: {geo_test_lift:.1%}")
# Check if geo test result within MMM credible interval
Documentation is Critical
Maintain a log of all modeling decisions: priors chosen, diagnostics observed, models compared,
and rationale for expansions. The framework's model.save() method preserves configs,
but document the why behind each choice separately. This prevents specification shopping
and enables honest reporting.
References
- Gelman, A., Vehtari, A., Simpson, D., et al. (2020). Bayesian Workflow. arXiv:2011.01808.
- Gabry, J., Simpson, D., Vehtari, A., Betancourt, M., & Gelman, A. (2019). Visualization in Bayesian workflow. J. R. Stat. Soc. A, 182(2), 389-402.
- Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413-1432.
- Betancourt, M. (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.