Bayesian Estimation and Priors for MMM

Summary

The nonlinear MMM is estimated by MCMC (a customized C++ Gibbs/slice sampler and a STAN/HMC implementation), placing support-respecting priors on each parameter: beta/uniform for the retention rate $α$ and half-saturation $K$ , uniform for the delay $θ$ , gamma for the slope $S$ , and half-normal for the nonnegative coefficients $β$ . The central empirical finding: when the sample is small and the signal weak, the posterior is dominated by the prior and the data cannot correct prior-induced bias — so prior choice has a large, sometimes determinative, impact on estimates and downstream attribution.

Overview

Because the adstock and Hill transforms make the model nonlinear in its parameters, maximizing the likelihood (frequentist MLE) is nontrivial. More importantly, a single MMM dataset carries little information relative to the parameter count, so the Bayesian framework is adopted specifically to incorporate prior knowledge from industry experience or related media-mix models. The model can also be extended to a hierarchical Bayesian form pooling across related brands (Wang et al. 2017) or geos (Sun et al. 2017) to manufacture more informative priors.

Main Content

Likelihood, posterior, and MCMC

Let $Φ$ be all model parameters, $X$ the media, $Z$ the controls, $y$ the response. The frequentist MLE is $\hat{Φ} = ar g max_{Φ} L (y ∣ X, Z, Φ)$ (Eq. 8). The Bayesian posterior is
$p (Φ ∣ y, X) \propto L (y ∣ X, Z, Φ) π (Φ) . (9)$
Conjugate priors would give an analytic posterior, but here samples are drawn by MCMC. Two samplers: a customized Gibbs sampler using a slice sampler (Neal 2003) in C++/BOOM (Scott 2016), and a STAN implementation using Hamiltonian Monte Carlo (HMC). High posterior correlation among the transformation parameters challenges STAN — it can take hours/days on a few-thousand points — so the custom Gibbs sampler is much more efficient. Posterior summaries: mean, median, or mode, plus quantile-based credible intervals. See MCMC Basics.

Prior specifications (with rationale)

Priors must respect each parameter’s support (Gelman 2006):

Retention rate $α \in [0, 1)$ — beta or uniform on $[0, 1)$ (simulation: $beta (3, 3)$ ); narrower support if strong prior knowledge.

Delay $θ \in [0, L - 1]$ — uniform or scaled beta (simulation: $uniform (0, 12)$ ).

Slope $S > 0$ — gamma with a positive mode (simulation: $gamma (3, 1)$ ).

Half-saturation $K$ — beta constrained over the observed spend range (simulation: $beta (2, 2)$ ), because $K$ outside the observed range is unidentifiable (see Shape (Saturation) Effects).

Coefficients $β_{m} \geq 0$ — half-normal (normal constrained nonnegative), since media effect is believed nonnegative (simulation: $half normal (0, 1)$ ).

Baseline $τ \sim normal (0, 5)$ ; controls $γ_{c} \sim normal (0, 1)$ ; noise variance $\sim inverse gamma (0.05, 0.0005)$ .

Prior dominance in small samples

If the data has strong information content, priors with the same support yield similar posteriors. If not, the prior has a large influence and the posterior may look almost the same as the prior. Empirically (Sec. 6): adstock parameters $(α, θ)$ are recovered fairly well even in small samples, but the shape parameters $(K, S, β)$ suffer high variance and large bias for small samples — the $β$ Hill curves are systematically underestimated. The bias is attributable to the priors: when sample size is small and signal weak, the data is not strong enough to correct prior-induced bias.

Examples

Sensitivity to the prior on $β$ (Sec. 7.1)

Three priors compared over 500 datasets: $half normal (0, 1)$ , $normal (0, 1)$ , $uniform (0, 3)$ . The two normal priors give nearly identical, underestimated $β$ Hill curves; $uniform (0, 3)$ puts more mass on large $β$ and so produces smaller bias (e.g. Media 2 at $x = 1$ : −18.0% / −18.0% / −1.5%). But this does not generalize — in scenarios where curves are over-estimated, $uniform (0, 3)$ would worsen the bias. There is no universally “correct” prior.

Sensitivity to the prior on $K$ (Sec. 7.2)

Priors $beta (2, 2)$ , $uniform (0, 1)$ , $uniform (0, 10)$ across two scenarios (true $K = 0.2$ inside the observed $[0, 1]$ range; true $K = 2$ outside it). The $β$ Hill curves are similar across all three priors, but the estimates of $K$ differ markedly for the wide $uniform (0, 10)$ prior. Because media effect depends on the curve (not the individual $K$ ), the model is not very sensitive to the $K$ prior — but a tighter, knowledge-backed prior speeds sampler convergence. When $K = 2$ lies outside the data range it cannot be estimated well even with a well-placed prior, yet the curve within range is still fine.

Connections

Estimates the combined model from Bayesian Media Mix Modeling - Overview; priors target the parameters of Carryover (Adstock) Functional Forms and Shape (Saturation) Effects.
General MCMC background (Gibbs, HMC, slice sampling): MCMC Basics; the linear-control part connects to Bayesian Linear Regression.
Posterior samples feed the attribution metrics: ROAS, mROAS, and Optimal Media Mix.
Model selection across functional forms (BIC): MMM Model Selection and Application.

Second Brain

Explorer

Bayesian Estimation and Priors for MMM

Bayesian Estimation and Priors for MMM

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks