Bayesian Workflow

A principled approach to building, checking, and refining Bayesian models. This guide follows the framework established by Gelman et al. (2020) and Gabry et al. (2019), adapted for Marketing Mix Models with practical PyMC implementations.

Key Reference

This workflow is based on Bayesian Workflow (Gelman et al., 2020) and Visualization in Bayesian Workflow (Gabry et al., 2019, JRSS-A). The core insight: Bayesian analysis is iterative—building, fitting, checking, and expanding models in a principled cycle.

Introduction

Bayesian data analysis is about more than computing a posterior distribution. It is an iterative process of model building, inference, model checking, and model expansion. Each step informs the next, and visualization is indispensable at every stage.

For Marketing Mix Models, this workflow is especially important because:

The Full Workflow

flowchart TB subgraph Build["1. Model Building"] A1[Define Generative Model] A2[Elicit Prior Distributions] A3[Encode Domain Knowledge] end subgraph Prior["2. Prior Predictive"] B1[Sample from Priors] B2[Generate Fake Data] B3[Check Plausibility] end subgraph Fit["3. Computational Fitting"] C1[Run MCMC Sampler] C2[Check Diagnostics] C3[Address Pathologies] end subgraph Post["4. Posterior Predictive"] D1[Generate Replicated Data] D2[Compare to Observed] D3[Check Calibration] end subgraph Compare["5. Model Comparison"] E1[Compute LOO-CV] E2[Compare ELPD] E3[Stack or Select] end subgraph Sense["6. Sensitivity Analysis"] F1[Vary Priors] F2[Check Data Influence] F3[Assess Robustness] end Build --> Prior Prior -->|Priors implausible| Build Prior -->|Priors ok| Fit Fit -->|Diagnostics fail| Build Fit -->|Diagnostics ok| Post Post -->|Model fails| Iterate Post -->|Model ok| Compare Compare --> Sense Sense --> Iterate subgraph Iterate["7. Model Expansion"] G1[Add Complexity] G2[Refine Structure] end Iterate -->|New model| Build style Build fill:#f0f7e6,stroke:#6d8a4a style Prior fill:#e6f0f7,stroke:#4a6d8a style Fit fill:#f7f0e6,stroke:#8a6d4a style Post fill:#e6f7f0,stroke:#4a8a6d style Compare fill:#f0e6f7,stroke:#6d4a8a style Sense fill:#f7e6f0,stroke:#8a4a6d style Iterate fill:#f7f7e6,stroke:#8a8a4a

The workflow is not strictly linear. At each stage, we may discover problems that send us back to an earlier stage. This is normal and expected—the goal is to build understanding, not just to produce a single "correct" answer.

Philosophy

The Generative Model Perspective

A Bayesian model is a generative model—a complete description of how data could have been generated. This includes the prior \(p(\theta)\), the likelihood \(p(y|\theta)\), and together they define the joint distribution \(p(y, \theta) = p(y|\theta)p(\theta)\). If you can't simulate data from your model, you don't fully understand it.

The workflow emphasizes several key principles:

Principle

Prediction First

Check models by their predictions—both before seeing data (prior predictive) and after fitting (posterior predictive). If predictions don't make sense, the model doesn't make sense.

Principle

Iterate Honestly

Model building is iterative, but changes should be driven by scientific reasoning, not by "trying things until the coefficients look right." Document every change.

Principle

Embrace Uncertainty

Report full posterior distributions, not just point estimates. Uncertainty is information— wide posteriors tell you the data can't distinguish between hypotheses.

1. Model Building

Model building starts with a generative story: how do we believe the data were generated? For an MMM, this story includes the baseline sales process, the effect of marketing activities, and the noise structure.

Prior Elicitation

Priors encode domain knowledge before seeing the data. In MMM, we often have strong beliefs:

Parameter Domain Knowledge Prior Implication
Media effects \(\beta\) Advertising generally has positive, diminishing returns Positive support, shrinkage toward zero
Adstock \(\alpha\) Effects decay over days/weeks, rarely months Beta prior concentrated in [0.3, 0.8]
Saturation \(\lambda\) Diminishing returns set in at realistic spend levels Prior centered on typical campaign spend
Seasonality amplitude Sales vary by season but not 10x Constrained to plausible range

Encoding Domain Knowledge

The prior is the formal mechanism for encoding domain knowledge. Consider the saturation parameter \(\lambda\) in a logistic saturation function:

$$f(x; \lambda) = \frac{1}{1 + e^{-\lambda x}} - 0.5$$ (saturation)

If we know that saturation typically occurs around 1M USD spend, we can set a prior on \(\lambda\) such that \(f(1M)\) has most of its mass in a plausible range (e.g., 60-90% of maximum effect).

# Prior on saturation parameter
# If typical_spend = 1M and we want 70% saturation at typical spend:
# solve: 0.7 = 1/(1 + exp(-lambda * 1M))
# lambda ≈ 0.85 / 1M

import pymc as pm

with pm.Model():
    # Lognormal prior on lambda, centered on expected value
    lam = pm.Lognormal(
        "lambda", 
        mu=np.log(0.85 / typical_spend), 
        sigma=0.5  # allows factor of ~3x variation
    )

2. Prior Predictive Checks

Before fitting the model to data, we sample from the prior predictive distribution:

$$p(y^{\text{rep}}) = \int p(y^{\text{rep}}|\theta) \, p(\theta) \, d\theta$$ (prior predictive)

This generates fake datasets from the model using only the priors—no observed data involved. The question is: could the observed data plausibly have come from this generative process?

Prior Pushforward

Prior Pushforward

The prior pushforward is the distribution of derived quantities (predictions, effect sizes, ROI) implied by the prior. Even if individual parameter priors seem reasonable, their combination might imply absurd predictions. Checking the pushforward catches these problems.

Interactive: Prior Predictive Check

Adjust the prior parameters and see how they affect the prior predictive distribution of sales.

0.5
0.3

Weakly Informative Priors

A prior is weakly informative if the prior predictive distribution:

Common Mistake: Flat Priors

"Non-informative" or flat priors on parameters often imply highly informative priors on predictions. For example, flat priors on many regression coefficients imply that extreme predictions are just as likely as moderate ones. Always check the prior pushforward.

# Prior predictive check in PyMC
with model:
    prior_pred = pm.sample_prior_predictive(samples=500)

# Visualize: do these look like plausible sales data?
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
    ax.plot(prior_pred.prior_predictive["y_obs"][0, i])
    ax.set_title(f"Draw {i+1}")
plt.suptitle("Prior Predictive Samples")
plt.tight_layout()

3. Computational Fitting

Once satisfied with the prior predictive, we fit the model using MCMC (typically NUTS in PyMC). The goal is to obtain samples from the posterior:

$$p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} \propto p(y|\theta)p(\theta)$$ (Bayes' theorem)

MCMC Diagnostics

MCMC produces samples from the posterior, not the posterior itself. We must verify that the samples are reliable:

Diagnostic Target What it Checks
\(\hat{R}\) (R-hat) < 1.01 Chains have converged to same distribution
ESS (bulk) > 400 Enough independent samples for central tendency
ESS (tail) > 400 Enough independent samples for tail quantiles
Divergences 0 Sampler exploring geometry correctly
Tree depth < max Sampler not struggling with curvature

Interactive: Trace Plot Diagnostics

Good chains look like "hairy caterpillars"—stationary, well-mixed, and indistinguishable.

Divergences and Pathologies

Divergences Indicate Geometry Problems

Divergent transitions mean the sampler encountered regions of high curvature it couldn't explore reliably. Common causes: non-centered parameterizations needed, funnel geometries, multimodality, or model misspecification. Never ignore divergences.

Solutions to common MCMC problems:

Problem

Divergences

  • Increase target_accept (0.9 → 0.95 → 0.99)
  • Use non-centered parameterization for hierarchical models
  • Reparameterize to reduce posterior curvature
  • Check for multimodality or unidentified parameters
Problem

Low ESS / High R-hat

  • Run more iterations
  • Check for slow-mixing parameters (treedepth warnings)
  • Improve initialization
  • Consider tighter priors if identification is weak
# Standard fitting with diagnostics
with model:
    trace = pm.sample(
        draws=2000,
        tune=2000,
        chains=4,
        target_accept=0.95,
        random_seed=42,
        return_inferencedata=True
    )

# Check diagnostics
print(az.summary(trace, var_names=["beta", "alpha", "lambda"]))
# Look for: R-hat < 1.01, ESS > 400, no divergences

4. Posterior Predictive Checks

After fitting, we generate data from the posterior predictive distribution:

$$p(y^{\text{rep}}|y) = \int p(y^{\text{rep}}|\theta) \, p(\theta|y) \, d\theta$$ (posterior predictive)

If the model is good, replicated data \(y^{\text{rep}}\) should look like the observed data \(y\). Systematic differences indicate model misspecification.

Interactive: Posterior Predictive Check

Compare observed data (dark line) to posterior predictive draws (light lines).

Calibration

A model is calibrated if its stated uncertainty is accurate. For example, 90% prediction intervals should contain the true value 90% of the time.

Probability Integral Transform (PIT)

If the posterior predictive is calibrated, the PIT values \(F_{y^{\text{rep}}}(y_{\text{obs}})\)—the CDF of the predictive distribution evaluated at the observed value—should be uniformly distributed. Deviations from uniformity indicate calibration problems.

Leave-One-Out Cross-Validation

LOO-CV estimates out-of-sample predictive performance without actually refitting the model for each held-out observation. PSIS-LOO uses importance sampling to approximate LOO efficiently:

# Compute LOO
loo = az.loo(trace, pointwise=True)
print(loo)

# Check Pareto k diagnostics
# k > 0.7: that observation is highly influential, LOO approximation unreliable
high_k = np.where(loo.pareto_k > 0.7)[0]
if len(high_k) > 0:
    print(f"Warning: {len(high_k)} observations with k > 0.7")

5. Model Comparison

When we have multiple candidate models, we compare them using their expected predictive performance. The key quantity is the expected log pointwise predictive density (ELPD):

$$\text{ELPD} = \sum_{i=1}^n \log p(y_i | y_{-i})$$ (leave-one-out ELPD)

ELPD and LOO-CV

Higher ELPD is better—it means the model makes better predictions on held-out data. The standard error of the ELPD difference tells us whether the difference is meaningful.

# Compare multiple models
comparison = az.compare({
    "standard": trace_standard,
    "nested": trace_nested,
    "hierarchical": trace_hierarchical
})
print(comparison)

# Visualize comparison
az.plot_compare(comparison)

Model Weights

Rather than selecting a single "best" model, consider stacking weights that combine models based on their predictive performance. This accounts for model uncertainty and often gives better predictions than any single model.

Model Topology

Gelman et al. (2020) recommend thinking of models as a topology—a network of related models that can be expanded or simplified. Each model comparison teaches us something about the structure of the problem. Below is the full model topology for the MMM Framework:

flowchart TB subgraph Core["Core Model Types"] STD[Standard MMM
Single outcome] NEST[Nested MMM
+ Mediators] MV[Multivariate MMM
+ Cross-effects] COMB[Combined MMM
Mediators + Cross-effects] end subgraph Hier["Hierarchical Extensions"] H_INT[+ Random Intercepts
Geo-level baseline] H_SLOPE[+ Random Slopes
Geo-level response] H_FULL[Full Hierarchical
Partial pooling] end subgraph VarSel["Variable Selection"] VS_HS[Regularized Horseshoe
Sparse precision controls] VS_SS[Spike-and-Slab
Discrete selection] VS_BL[Bayesian LASSO
L1 regularization] end STD --> NEST STD --> MV NEST --> COMB MV --> COMB STD --> H_INT NEST --> H_INT MV --> H_INT COMB --> H_INT H_INT --> H_SLOPE H_SLOPE --> H_FULL H_FULL -.-> VS_HS H_FULL -.-> VS_SS H_FULL -.-> VS_BL style STD fill:#f0f7e6,stroke:#6d8a4a style NEST fill:#e6f0f7,stroke:#4a6d8a style MV fill:#f7f0e6,stroke:#8a6d4a style COMB fill:#f0e6f7,stroke:#6d4a8a style H_FULL fill:#e6f7f0,stroke:#4a8a6d style VS_HS fill:#f7e6e6,stroke:#8a4a4a

Model Selection Guide

ScenarioModel ChoiceKey Considerations
Single outcome, national data Standard MMM Start here; acknowledge wide uncertainty
Upper-funnel metrics available Nested MMM Awareness/consideration as mediators; direct/indirect decomposition
Product portfolio effects Multivariate MMM Cannibalization, halo effects; correlated errors via LKJ
Full funnel + portfolio Combined MMM Highest complexity; requires strong identification
Multi-geo, national media + Random Intercepts Geo baselines; not geo-level media response
Multi-geo, regional media + Random Slopes Geo-level variation in media creates identification
Many precision controls + Variable Selection Horseshoe/spike-slab on controls only; never on media

Variable Selection Caution

Variable selection priors (horseshoe, spike-and-slab) should only be applied to precision control variables—not to confounders, mediators, or media channels. Applying selection priors to causal variables can severely bias effect estimates. See the Variable Selection Guide for details.

6. Sensitivity Analysis

How much do our conclusions depend on our modeling choices? Sensitivity analysis systematically varies assumptions to assess robustness.

Prior Sensitivity

Refit the model with different priors and compare posteriors. If conclusions change dramatically with reasonable prior variations, the data are not strongly informative about those parameters.

Interactive: Prior Sensitivity

See how the posterior changes as the prior becomes more or less informative.

0.5

Data Influence

The Pareto k diagnostics from LOO-CV identify influential observations. High-influence observations warrant investigation:

7. Model Expansion

Based on posterior predictive checks and domain understanding, we iteratively expand the model to address deficiencies. The MMM Framework provides structured expansion paths:

Standard → Nested

Add Mediators

When upper-funnel metrics (awareness, consideration) are available, add them as mediators to decompose effects into direct and indirect pathways.

  • Fully observed: complete time series
  • Partially observed: sparse survey data
  • Fully latent: requires strong priors
Standard → Multivariate

Add Cross-Effects

When modeling product portfolios, add cross-effects to capture cannibalization or halo effects between outcomes.

  • Cannibalization: negative cross-effect
  • Halo: positive cross-effect
  • LKJ prior on error correlations
Any → Hierarchical

Add Partial Pooling

With multi-geo data, add hierarchical structure to borrow strength across regions while capturing heterogeneity.

  • Random intercepts: geo baselines
  • Random slopes: requires geo media variation
  • Non-centered for divergence control
Expansion

Improve Time Structure

If residuals are autocorrelated, add more flexible trend/seasonality:

  • Piecewise linear trend with changepoints
  • Gaussian Process trend for smooth variation
  • Fourier seasonality with more harmonics
  • Holiday/event effects
Expansion

Refine Transformations

If saturation/carryover curves don't match patterns:

  • Logistic vs. Hill saturation functions
  • Geometric vs. Weibull adstock decay
  • Channel-specific vs. shared parameters
  • Time-varying saturation for mature brands

Expansion Decision Tree

flowchart TD A[Posterior Predictive Check] --> B{Systematic
Residual Pattern?} B -->|No| C[Model OK] B -->|Yes| D{What type?} D -->|Autocorrelation| E[Add Trend/AR] D -->|Geo heterogeneity| F[Add Hierarchical] D -->|Cross-outcome| G[Add Multivariate] D -->|Mediation signal| H[Add Nested] E --> I{Improved ELPD?} F --> I G --> I H --> I I -->|Yes, > 1 SE| J[Accept Expansion] I -->|No| K[Reject / Investigate] J --> A K --> L[Check Priors/
Data Quality] style C fill:#e6f7e6,stroke:#4a8a4a style J fill:#e6f0f7,stroke:#4a6d8a style K fill:#f7e6e6,stroke:#8a4a4a

When to Stop

Model expansion has diminishing returns. Stop when:

Avoid Specification Shopping

If you try many models and only report the one with "reasonable" coefficients, you're specification shopping. This invalidates uncertainty quantification and leads to overconfident conclusions. Document all models tried and all decisions made.

MMM-Specific Workflow

Applying the Bayesian workflow to Marketing Mix Models involves specific considerations for the MMM Framework's model classes:

Workflow by Model Type

Standard → Nested Workflow

  1. Fit Standard MMM first — Establish baseline effects and diagnostics
  2. Check mediator data — Is it fully observed, sparse, or absent?
  3. Configure mediator typeFULLY_OBSERVED, PARTIALLY_OBSERVED, or FULLY_LATENT
  4. Map channels to mediators — Not all channels need to affect all mediators
  5. Compare ELPD — Does the nested model improve predictions?
  6. Interpret decomposition — Direct vs. indirect effects, proportion mediated

Standard → Multivariate Workflow

  1. Fit separate Standard MMMs — One per outcome, check individual diagnostics
  2. Check residual correlations — Are outcome residuals correlated?
  3. Configure outcomesOutcomeConfig for each with separate/shared transformations
  4. Add cross-effectsCANNIBALIZATION, HALO, or SPILLOVER
  5. Set LKJ prior — η=1 (uniform), η=2 (shrink toward independence)
  6. Interpret cross-effects — Effect of outcome A on outcome B

MMM Prior Guidance

For each major parameter class in the framework, here is guidance on setting informative priors:

Parameter Framework Default Rationale
Media β (log scale) HalfNormal(σ=0.5) Positive effects; ~5% lift per unit at median
Adstock α Beta(α=3, β=3) Centered at 0.5; most mass in [0.2, 0.8]
Saturation λ (logistic) Gamma(α=2, β=1/x̄) Scaled to mean spend; 50% saturation at typical level
Mediator→Outcome γ HalfNormal(σ=1) Positive awareness→sales; scale depends on units
Cross-effect (cannib.) HalfNormal(σ=0.3) Constrained negative; typically small relative to main effects
Hierarchical σ (geo) HalfNormal(σ=0.5) Prior on geo-level variation; weakly informative
LKJ η (correlations) η=2.0 Mild shrinkage toward independence; η=1 is uniform
Horseshoe τ (global) D₀/(D-D₀) · σ/√n Calibrated by expected # nonzero (Piironen & Vehtari, 2017)
# Example: Configure priors in the framework
from mmm_framework import ModelConfigBuilder, AdstockConfigBuilder, SaturationConfigBuilder

config = (
    ModelConfigBuilder()
    .with_adstock(
        AdstockConfigBuilder()
        .with_type("geometric")
        .with_alpha_prior("Beta", alpha=3, beta=3)  # Centered at 0.5
        .build()
    )
    .with_saturation(
        SaturationConfigBuilder()
        .with_type("logistic")
        .with_lam_prior("Gamma", alpha=2, beta=1/mean_spend)  # Scale to data
        .build()
    )
    .with_media_prior("HalfNormal", sigma=0.5)  # Positive effects
    .build()
)

Validation Strategies

MMM validation goes beyond statistical checks. The framework supports multiple validation approaches:

Internal

Holdout Validation

Hold out recent periods and assess prediction accuracy using the framework's predict() method.

  • MAPE, RMSE on holdout
  • Coverage of prediction intervals
  • ⚠️ Tests extrapolation, not causality
External

Geo Experiments

Compare MMM estimates to randomized geo experiments—the gold standard for causal validation.

  • Match lift estimates to geo tests
  • Assess coverage of credible intervals
  • Calibrate priors if systematic bias
Business

Plausibility Checks

Verify results against domain knowledge and industry benchmarks.

  • ROI within industry norms
  • Adstock aligns with creative wear-out
  • Saturation at plausible spend levels

Framework Validation Code

# Holdout validation
train_data, test_data = panel.split(holdout_periods=8)
model = BayesianMMM(train_data, config)
model.fit()

# Predict on holdout
predictions = model.predict(test_data)
coverage = model.compute_coverage(test_data, ci=0.9)
print(f"90% CI Coverage: {coverage:.1%}")  # Should be ~90%

# Compare to geo experiment
geo_test_lift = 0.12  # 12% lift from experiment
mmm_lift = model.compute_channel_lift("tv", scenario="actual_vs_zero")
print(f"MMM Lift: {mmm_lift.mean():.1%} [{mmm_lift.hdi(0.9)}]")
print(f"Geo Test Lift: {geo_test_lift:.1%}")
# Check if geo test result within MMM credible interval

Documentation is Critical

Maintain a log of all modeling decisions: priors chosen, diagnostics observed, models compared, and rationale for expansions. The framework's model.save() method preserves configs, but document the why behind each choice separately. This prevents specification shopping and enables honest reporting.

References