Causal Inference for Analysts
A practical guide to causal reasoning using Judea Pearl's framework of structural causal models, do-calculus, and counterfactual analysis—with applications to Marketing Mix Modeling.
Key References
This guide draws primarily from Pearl, Glymour, & Jewell (2016) Causal Inference in Statistics: A Primer, Pearl (2009) Causality, and Pearl & Mackenzie (2018) The Book of Why. For MMM-specific applications, see also Cunningham (2021) Causal Inference: The Mixtape.
Why Causal Inference?
Marketing Mix Modeling asks fundamentally causal questions: What is the effect of TV advertising on sales? If we had spent more on digital, how much more would we have sold? These questions cannot be answered by correlation alone.
Correlation ≠ Causation
Ice cream sales and drowning deaths are correlated. But eating ice cream doesn't cause drowning— both are caused by summer heat. Confusing correlation with causation leads to wrong decisions.
Prediction ≠ Intervention
A model can predict sales perfectly from ad spend historically, yet give completely wrong answers about what happens if we change ad spend. Prediction is about patterns; causation is about mechanisms.
The goal of causal inference is to answer interventional and counterfactual questions from observational data—when experiments are impossible or impractical.
The Fundamental Problem of Causal Inference
The Fundamental Problem
For any individual unit at any moment in time, we can observe at most one potential outcome— the outcome under the treatment actually received. The outcome under the alternative treatment is forever unobservable. This missing data problem is the fundamental challenge of causal inference.
Consider a single week where we spent 100K USD on TV. We observed sales of 1M USD. What would sales have been if we had spent 0 USD on TV that week? We can never observe this directly— we can't rewind time and run the alternative scenario.
Where \(Y^{a=1}\) is the potential outcome under treatment and \(Y^{a=0}\) is the potential outcome under control. We observe one; the other is the counterfactual.
Structural Causal Models
A Structural Causal Model (SCM) is a mathematical object that encodes causal relationships. It consists of three components:
Definition: Structural Causal Model
An SCM \(\mathcal{M} = \langle U, V, F \rangle\) consists of:
- U: Exogenous variables (external, not caused by anything in the model)
- V: Endogenous variables (determined by variables in the model)
- F: Structural equations \(v_i = f_i(\text{pa}_i, u_i)\) for each \(v_i \in V\)
Example: Simple Marketing SCM
The structural equations are asymmetric: \(\text{AdSpend} = f(\text{Budget})\) means Budget causes AdSpend, not the reverse. This asymmetry encodes the direction of causation.
Directed Acyclic Graphs (DAGs)
Every SCM implies a Directed Acyclic Graph (DAG) where nodes are variables and edges point from causes to effects. DAGs provide a visual language for causal reasoning.
Budget → Ad Spend → Awareness → Sales, with a direct path Ad Spend → Sales
d-Separation
d-separation is a graphical criterion for reading conditional independence from a DAG. Two variables are d-separated (conditionally independent) given a set Z if every path between them is "blocked."
Path Blocking Rules
A path is blocked by conditioning set Z if:
- The path contains a chain A → B → C or fork A ← B → C, and B ∈ Z
- The path contains a collider A → B ← C, and B ∉ Z (and no descendant of B is in Z)
The Three Causal Building Blocks
Chain (Mediation)
A causes C through B. Conditioning on B blocks the path from A to C.
Example: Ad → Awareness → Sales
Fork (Confounding)
B causes both A and C, creating spurious correlation. Conditioning on B blocks this.
Example: Season → Ice Cream, Season → Drowning
Collider (Selection)
A and C both cause B. They're independent, but conditioning on B creates spurious association.
Example: Talent → Hollywood, Beauty → Hollywood
Pearl's Ladder of Causation
Pearl's Ladder of Causation (also called the Causal Hierarchy) describes three levels of causal reasoning, each requiring different information and enabling different queries.
"What would Y have been if X had been different, given what I observed?"
Query: \(P(Y_x | X', Y')\) — Requires full SCM with specific \(U\) values
"What would happen to Y if I set X to a specific value?"
Query: \(P(Y | do(X=x))\) — Requires causal graph structure
"What does observing X tell me about Y?"
Query: \(P(Y | X=x)\) — Requires only observational data
The Hierarchy is Strict
You cannot answer Level 2 questions with Level 1 data alone, nor Level 3 questions with Level 2 information alone—no matter how much data you have. Each level requires additional assumptions (encoded in the causal model) to climb the ladder.
Level 1: Association
Association queries ask about statistical dependencies: How are X and Y related in the data?
Standard machine learning and predictive modeling operate at Level 1. Given historical data, they learn \(P(Y|X)\)—the distribution of Y given we observe X. This is sufficient for prediction but not for intervention.
Level 2: Intervention
Intervention queries ask what happens when we act: If I set X to x, what happens to Y?
The \(do(\cdot)\) operator represents an intervention—physically setting a variable to a value, rather than passively observing it. This is fundamentally different from conditioning.
Interactive: Seeing vs. Doing
Compare the observational and interventional distributions for a confounded system.
Level 3: Counterfactuals
Counterfactual queries ask about alternative histories: Given what actually happened, what would have happened under different circumstances?
This asks: "For a unit that actually received X=x and had outcome Y=y, what would Y have been if X had been x' instead?" This is the language of individual causal effects.
MMM Counterfactuals
"Given that we spent 100K USD on TV and observed 1M USD in sales, how much would we have sold if we had spent 0 USD on TV?" This is a counterfactual question—it's about a specific week that already happened.
The do-Operator
The \(do(\cdot)\) operator formalizes intervention. When we write \(do(X=x)\), we mean:
- Delete all arrows pointing into X in the causal graph
- Set X to the value x
- Compute the resulting distribution over other variables
This "graph surgery" reflects what happens in an experiment: we override the natural causes of X and force it to take a specific value.
Observational
P(Y|X) confounded by Z
Interventional
P(Y|do(X)) – Z no longer confounds
The Three Rules of do-Calculus
Pearl's do-calculus provides three rules for manipulating expressions containing \(do(\cdot)\). Together, they are complete: any identifiable causal effect can be computed using these rules.
Rule 1: Insertion/Deletion of Observations
Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}}\) (the graph with all arrows into X deleted).
Intuition: If Z doesn't affect Y once we've intervened on X and conditioned on W, we can ignore Z.
Rule 2: Action/Observation Exchange
Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}, \underline{Z}}\) (arrows into X deleted, arrows out of Z deleted).
Intuition: If intervening on Z has the same effect as observing Z (given the other conditions), we can replace \(do(Z)\) with observation.
Rule 3: Insertion/Deletion of Actions
Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}, \overline{Z(W)}}\) where \(Z(W)\) is the set of Z-nodes not ancestors of any W-node in \(G_{\overline{X}}\).
Intuition: If Z has no effect on Y given our other interventions and observations, we can drop \(do(Z)\).
Identification
A causal effect is identifiable if it can be computed from observational data plus the causal graph structure, without knowing the functional forms \(f_i\).
Identifiability
The causal effect \(P(Y|do(X))\) is identifiable if it can be expressed as a function of the observational distribution \(P(V)\) alone. If it cannot, no amount of observational data will reveal the causal effect—an experiment is required.
The Backdoor Criterion
The backdoor criterion provides a simple sufficient condition for identification.
Backdoor Criterion
A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:
- No node in Z is a descendant of X
- Z blocks all backdoor paths from X to Y (paths with an arrow into X)
If Z satisfies the backdoor criterion, the causal effect is identified by the backdoor adjustment formula:
Interactive: Backdoor Adjustment
See how adjusting for the confounder recovers the true causal effect.
The Frontdoor Criterion
When the backdoor criterion fails (no valid adjustment set exists), the frontdoor criterion may still identify the effect through a mediator.
Frontdoor Criterion
A set of variables M satisfies the frontdoor criterion relative to (X, Y) if:
- M intercepts all directed paths from X to Y
- There is no unblocked backdoor path from X to M
- All backdoor paths from M to Y are blocked by X
U confounds X→Y, but M satisfies the frontdoor criterion
Instrumental Variables
An instrumental variable Z affects X but has no direct effect on Y (except through X). It provides identification when direct backdoor adjustment is impossible.
For linear models, the IV estimate is:
Computing Counterfactuals
Counterfactual reasoning requires three steps:
1. Abduction
Given the evidence (what we observed), infer the values of exogenous variables U that are consistent with the observation.
2. Action
Modify the structural equations to reflect the hypothetical intervention (set X to the counterfactual value).
3. Prediction
Use the modified model with the inferred U values to compute the counterfactual outcome.
Example: Counterfactual Sales
Consider a simple linear SCM:
Observation: We observed X=100 (spent 100K USD) and Y=500 (500K USD sales).
Question: What would Y have been if X had been 0?
Step 1 (Abduction): From Y = βX + U_Y and the observation, infer U_Y = 500 - 100β.
Step 2 (Action): Set X = 0 in the modified model.
Step 3 (Prediction): Y_{X=0} = β(0) + U_Y = 500 - 100β.
If β = 3 (each 1K USD in ad spend generates 3K USD in sales), then Y_{X=0} = 500 - 300 = 200K USD.
The Role of the Structural Model
Counterfactual computation requires knowing (or estimating) the structural equations, not just the DAG. The functional form matters—linear models give different counterfactuals than nonlinear models.
Common Pitfalls
Confounding
Confounding occurs when a common cause of treatment and outcome creates a spurious association.
Seasonality increases both ad spend (holiday campaigns) and sales (holiday demand), inflating the apparent effect of ads.
Solution: Control for confounders via backdoor adjustment, or use experimental designs.
Collider Bias
Collider bias (selection bias, Berkson's paradox) occurs when you condition on a common effect of two variables, creating a spurious association between them.
Among successful brands (conditioning on C), ad quality and luck appear negatively correlated— but they're actually independent.
Never Condition on Colliders
Conditioning on a collider (or its descendants) opens a spurious path between its causes. This is why controlling for "everything" can make estimates worse, not better.
Overcontrol Bias
Overcontrol bias occurs when you adjust for a mediator on the causal pathway, blocking part of the effect you're trying to measure.
If you control for Awareness, you block the X→M→Y path, underestimating TV's total effect.
Rule: Only control for confounders, never mediators (unless you specifically want the direct effect).
MMM as a DAG
A Marketing Mix Model encodes specific causal assumptions. Here's the DAG implied by a standard MMM:
Media Effect Identification
For media effects to be identified in an MMM, we need to satisfy the backdoor criterion by controlling for all common causes of media spend and sales.
| Potential Confounder | Effect on Spend | Effect on Sales | Action |
|---|---|---|---|
| Seasonality | Holiday campaigns | Holiday demand | Include seasonal controls |
| Trend | Growing budgets | Market growth | Include trend component |
| Promotions | Often co-occur with ads | Direct sales lift | Include promotion indicator |
| Competitor activity | Defensive spending | Market share shifts | Include if observable |
National Media, Geo-Level "Effects"
If media spend is national (same value for all geos), you cannot identify geo-level differential effects from observational data. Any apparent geo-level variation reflects correlation with timing patterns, not causal heterogeneity. See the Hierarchical Model section for details.
Counterfactual Contribution Analysis
MMM contribution analysis is fundamentally counterfactual: "How much of observed sales is attributable to each channel?" This requires computing:
This is the difference between actual sales and the counterfactual sales if channel c had been set to zero. The MMM Framework implements this via:
# Counterfactual contribution in the framework
contributions = model.compute_contributions(
method="counterfactual", # vs "marginal" decomposition
baseline="zero" # counterfactual: set channel to zero
)
# For each posterior draw, compute Y - Y_{do(X_c=0)}
# Returns full posterior distribution of contributions
Contributions Sum to More Than Total
With nonlinear (saturating) effects, the sum of individual counterfactual contributions typically exceeds total sales minus baseline. This is because removing all channels simultaneously is different from removing each one at a time (interaction effects). This is expected and correct.
Specification Shopping and Causal Inference
Specification shopping—trying multiple model specifications and selecting the one that gives "reasonable" results—is incompatible with causal inference.
Why Specification Shopping Breaks Causal Inference
Causal identification relies on a priori specification of the causal graph and functional forms. When you:
- Remove channels with "wrong sign" coefficients
- Add controls until results "look right"
- Try different functional forms until ROIs are "reasonable"
...you invalidate all statistical inference. The reported uncertainty no longer reflects true uncertainty, and "causal" estimates become meaningless.
The Correct Approach
- Specify the DAG first — Based on domain knowledge, not data
- Identify adjustment sets — Use backdoor/frontdoor criteria
- Specify priors — Encode beliefs before seeing results
- Fit once and report — Accept wide uncertainty if that's what the data support
- Iterate scientifically — Model changes must be justified by theory, not by making results look better
The Preregistration Mindset
Treat your analysis plan like a preregistered experiment. Specify the model before looking at coefficient estimates. Document changes and their justifications. Report all models tried, not just the final one.
Key Takeaways
Draw the DAG First
Before any analysis, explicitly draw the assumed causal structure. This clarifies assumptions and identifies what needs to be controlled.
Intervention ≠ Observation
\(P(Y|X)\) and \(P(Y|do(X))\) are different quantities. Never confuse conditional probabilities with causal effects.
Counterfactuals Need Models
Counterfactual questions require structural models, not just data. The functional form matters—encode it thoughtfully.
References
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
- Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal Inference in Statistics: A Primer. Wiley.
- Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
- Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. [Free online]
- Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC. [Free online]
- Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference. MIT Press.