Causal Inference for Analysts

A practical guide to causal reasoning using Judea Pearl's framework of structural causal models, do-calculus, and counterfactual analysis—with applications to Marketing Mix Modeling.

Key References

This guide draws primarily from Pearl, Glymour, & Jewell (2016) Causal Inference in Statistics: A Primer, Pearl (2009) Causality, and Pearl & Mackenzie (2018) The Book of Why. For MMM-specific applications, see also Cunningham (2021) Causal Inference: The Mixtape.

Why Causal Inference?

Marketing Mix Modeling asks fundamentally causal questions: What is the effect of TV advertising on sales? If we had spent more on digital, how much more would we have sold? These questions cannot be answered by correlation alone.

Correlation ≠ Causation

Ice cream sales and drowning deaths are correlated. But eating ice cream doesn't cause drowning— both are caused by summer heat. Confusing correlation with causation leads to wrong decisions.

Prediction ≠ Intervention

A model can predict sales perfectly from ad spend historically, yet give completely wrong answers about what happens if we change ad spend. Prediction is about patterns; causation is about mechanisms.

The goal of causal inference is to answer interventional and counterfactual questions from observational data—when experiments are impossible or impractical.

The Fundamental Problem of Causal Inference

The Fundamental Problem

For any individual unit at any moment in time, we can observe at most one potential outcome— the outcome under the treatment actually received. The outcome under the alternative treatment is forever unobservable. This missing data problem is the fundamental challenge of causal inference.

Consider a single week where we spent 100K USD on TV. We observed sales of 1M USD. What would sales have been if we had spent 0 USD on TV that week? We can never observe this directly— we can't rewind time and run the alternative scenario.

$$\text{Causal Effect} = Y^{a=1} - Y^{a=0}$$ (individual effect)

Where \(Y^{a=1}\) is the potential outcome under treatment and \(Y^{a=0}\) is the potential outcome under control. We observe one; the other is the counterfactual.

Structural Causal Models

A Structural Causal Model (SCM) is a mathematical object that encodes causal relationships. It consists of three components:

Definition: Structural Causal Model

An SCM \(\mathcal{M} = \langle U, V, F \rangle\) consists of:

  • U: Exogenous variables (external, not caused by anything in the model)
  • V: Endogenous variables (determined by variables in the model)
  • F: Structural equations \(v_i = f_i(\text{pa}_i, u_i)\) for each \(v_i \in V\)

Example: Simple Marketing SCM

$$\begin{aligned} \text{Budget} &= U_{\text{budget}} \\ \text{AdSpend} &= f_1(\text{Budget}, U_{\text{ad}}) \\ \text{Awareness} &= f_2(\text{AdSpend}, U_{\text{aware}}) \\ \text{Sales} &= f_3(\text{Awareness}, \text{AdSpend}, U_{\text{sales}}) \end{aligned}$$

The structural equations are asymmetric: \(\text{AdSpend} = f(\text{Budget})\) means Budget causes AdSpend, not the reverse. This asymmetry encodes the direction of causation.

Directed Acyclic Graphs (DAGs)

Every SCM implies a Directed Acyclic Graph (DAG) where nodes are variables and edges point from causes to effects. DAGs provide a visual language for causal reasoning.

graph LR B[Budget] --> A[Ad Spend] A --> W[Awareness] A --> S[Sales] W --> S style B fill:#f0f7e6,stroke:#6d8a4a style A fill:#e6f0f7,stroke:#4a6d8a style W fill:#f7f0e6,stroke:#8a6d4a style S fill:#f0e6f7,stroke:#6d4a8a

Budget → Ad Spend → Awareness → Sales, with a direct path Ad Spend → Sales

d-Separation

d-separation is a graphical criterion for reading conditional independence from a DAG. Two variables are d-separated (conditionally independent) given a set Z if every path between them is "blocked."

Path Blocking Rules

A path is blocked by conditioning set Z if:

  1. The path contains a chain A → B → C or fork A ← B → C, and B ∈ Z
  2. The path contains a collider A → B ← C, and B ∉ Z (and no descendant of B is in Z)

The Three Causal Building Blocks

Chain (Mediation)

graph LR A((A)) --> B((B)) --> C((C))

A causes C through B. Conditioning on B blocks the path from A to C.

Example: Ad → Awareness → Sales

Fork (Confounding)

graph LR B((B)) --> A((A)) B --> C((C))

B causes both A and C, creating spurious correlation. Conditioning on B blocks this.

Example: Season → Ice Cream, Season → Drowning

Collider (Selection)

graph LR A((A)) --> B((B)) C((C)) --> B

A and C both cause B. They're independent, but conditioning on B creates spurious association.

Example: Talent → Hollywood, Beauty → Hollywood

Pearl's Ladder of Causation

Pearl's Ladder of Causation (also called the Causal Hierarchy) describes three levels of causal reasoning, each requiring different information and enabling different queries.

Level 3
Counterfactuals (Imagining)
"What would Y have been if X had been different, given what I observed?"
Query: \(P(Y_x | X', Y')\) — Requires full SCM with specific \(U\) values
Level 2
Intervention (Doing)
"What would happen to Y if I set X to a specific value?"
Query: \(P(Y | do(X=x))\) — Requires causal graph structure
Level 1
Association (Seeing)
"What does observing X tell me about Y?"
Query: \(P(Y | X=x)\) — Requires only observational data

The Hierarchy is Strict

You cannot answer Level 2 questions with Level 1 data alone, nor Level 3 questions with Level 2 information alone—no matter how much data you have. Each level requires additional assumptions (encoded in the causal model) to climb the ladder.

Level 1: Association

Association queries ask about statistical dependencies: How are X and Y related in the data?

$$P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)}$$ (conditional probability)

Standard machine learning and predictive modeling operate at Level 1. Given historical data, they learn \(P(Y|X)\)—the distribution of Y given we observe X. This is sufficient for prediction but not for intervention.

Level 2: Intervention

Intervention queries ask what happens when we act: If I set X to x, what happens to Y?

$$P(Y=y | do(X=x))$$ (interventional distribution)

The \(do(\cdot)\) operator represents an intervention—physically setting a variable to a value, rather than passively observing it. This is fundamentally different from conditioning.

Interactive: Seeing vs. Doing

Compare the observational and interventional distributions for a confounded system.

0.7

Level 3: Counterfactuals

Counterfactual queries ask about alternative histories: Given what actually happened, what would have happened under different circumstances?

$$P(Y_{X=x'} = y | X=x, Y=y)$$ (counterfactual probability)

This asks: "For a unit that actually received X=x and had outcome Y=y, what would Y have been if X had been x' instead?" This is the language of individual causal effects.

MMM Counterfactuals

"Given that we spent 100K USD on TV and observed 1M USD in sales, how much would we have sold if we had spent 0 USD on TV?" This is a counterfactual question—it's about a specific week that already happened.

The do-Operator

The \(do(\cdot)\) operator formalizes intervention. When we write \(do(X=x)\), we mean:

  1. Delete all arrows pointing into X in the causal graph
  2. Set X to the value x
  3. Compute the resulting distribution over other variables

This "graph surgery" reflects what happens in an experiment: we override the natural causes of X and force it to take a specific value.

Observational

graph TB Z[Confounder Z] --> X[Treatment X] Z --> Y[Outcome Y] X --> Y style Z fill:#f7e6e6,stroke:#8a4a4a

P(Y|X) confounded by Z

Interventional

graph TB Z[Confounder Z] --> Y[Outcome Y] X[do X = x] --> Y style X fill:#e6f7e6,stroke:#4a8a4a style Z fill:#f0f0f0,stroke:#999

P(Y|do(X)) – Z no longer confounds

The Three Rules of do-Calculus

Pearl's do-calculus provides three rules for manipulating expressions containing \(do(\cdot)\). Together, they are complete: any identifiable causal effect can be computed using these rules.

Rule 1: Insertion/Deletion of Observations

$$P(Y | do(X), Z, W) = P(Y | do(X), W)$$

Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}}\) (the graph with all arrows into X deleted).

Intuition: If Z doesn't affect Y once we've intervened on X and conditioned on W, we can ignore Z.

Rule 2: Action/Observation Exchange

$$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)$$

Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}, \underline{Z}}\) (arrows into X deleted, arrows out of Z deleted).

Intuition: If intervening on Z has the same effect as observing Z (given the other conditions), we can replace \(do(Z)\) with observation.

Rule 3: Insertion/Deletion of Actions

$$P(Y | do(X), do(Z), W) = P(Y | do(X), W)$$

Condition: \(Y \perp\!\!\!\perp Z | X, W\) in the graph \(G_{\overline{X}, \overline{Z(W)}}\) where \(Z(W)\) is the set of Z-nodes not ancestors of any W-node in \(G_{\overline{X}}\).

Intuition: If Z has no effect on Y given our other interventions and observations, we can drop \(do(Z)\).

Identification

A causal effect is identifiable if it can be computed from observational data plus the causal graph structure, without knowing the functional forms \(f_i\).

Identifiability

The causal effect \(P(Y|do(X))\) is identifiable if it can be expressed as a function of the observational distribution \(P(V)\) alone. If it cannot, no amount of observational data will reveal the causal effect—an experiment is required.

The Backdoor Criterion

The backdoor criterion provides a simple sufficient condition for identification.

Backdoor Criterion

A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:

  1. No node in Z is a descendant of X
  2. Z blocks all backdoor paths from X to Y (paths with an arrow into X)

If Z satisfies the backdoor criterion, the causal effect is identified by the backdoor adjustment formula:

$$P(Y | do(X)) = \sum_z P(Y | X, Z=z) \cdot P(Z=z)$$ (backdoor adjustment)

Interactive: Backdoor Adjustment

See how adjusting for the confounder recovers the true causal effect.

0.5
0.8

The Frontdoor Criterion

When the backdoor criterion fails (no valid adjustment set exists), the frontdoor criterion may still identify the effect through a mediator.

Frontdoor Criterion

A set of variables M satisfies the frontdoor criterion relative to (X, Y) if:

  1. M intercepts all directed paths from X to Y
  2. There is no unblocked backdoor path from X to M
  3. All backdoor paths from M to Y are blocked by X
graph LR U[Unobserved U] -.-> X[Treatment X] U -.-> Y[Outcome Y] X --> M[Mediator M] M --> Y style U fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5

U confounds X→Y, but M satisfies the frontdoor criterion

$$P(Y | do(X)) = \sum_m P(M=m | X) \sum_{x'} P(Y | M=m, X=x') P(X=x')$$ (frontdoor formula)

Instrumental Variables

An instrumental variable Z affects X but has no direct effect on Y (except through X). It provides identification when direct backdoor adjustment is impossible.

graph LR Z[Instrument Z] --> X[Treatment X] U[Unobserved U] -.-> X U -.-> Y[Outcome Y] X --> Y style U fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5 style Z fill:#e6f7e6,stroke:#4a8a4a

For linear models, the IV estimate is:

$$\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}$$ (IV estimator)

Computing Counterfactuals

Counterfactual reasoning requires three steps:

1. Abduction

Given the evidence (what we observed), infer the values of exogenous variables U that are consistent with the observation.

2. Action

Modify the structural equations to reflect the hypothetical intervention (set X to the counterfactual value).

3. Prediction

Use the modified model with the inferred U values to compute the counterfactual outcome.

Example: Counterfactual Sales

Consider a simple linear SCM:

$$\begin{aligned} X &= U_X \\ Y &= \beta X + U_Y \end{aligned}$$

Observation: We observed X=100 (spent 100K USD) and Y=500 (500K USD sales).

Question: What would Y have been if X had been 0?

Step 1 (Abduction): From Y = βX + U_Y and the observation, infer U_Y = 500 - 100β.

Step 2 (Action): Set X = 0 in the modified model.

Step 3 (Prediction): Y_{X=0} = β(0) + U_Y = 500 - 100β.

If β = 3 (each 1K USD in ad spend generates 3K USD in sales), then Y_{X=0} = 500 - 300 = 200K USD.

The Role of the Structural Model

Counterfactual computation requires knowing (or estimating) the structural equations, not just the DAG. The functional form matters—linear models give different counterfactuals than nonlinear models.

Common Pitfalls

Confounding

Confounding occurs when a common cause of treatment and outcome creates a spurious association.

graph TB Z[Seasonality] --> X[Ad Spend] Z --> Y[Sales] X --> Y style Z fill:#f7e6e6,stroke:#8a4a4a

Seasonality increases both ad spend (holiday campaigns) and sales (holiday demand), inflating the apparent effect of ads.

Solution: Control for confounders via backdoor adjustment, or use experimental designs.

Collider Bias

Collider bias (selection bias, Berkson's paradox) occurs when you condition on a common effect of two variables, creating a spurious association between them.

graph TB X[Ad Quality] --> C[Brand Success] Y[Market Luck] --> C style C fill:#f7e6e6,stroke:#8a4a4a

Among successful brands (conditioning on C), ad quality and luck appear negatively correlated— but they're actually independent.

Never Condition on Colliders

Conditioning on a collider (or its descendants) opens a spurious path between its causes. This is why controlling for "everything" can make estimates worse, not better.

Overcontrol Bias

Overcontrol bias occurs when you adjust for a mediator on the causal pathway, blocking part of the effect you're trying to measure.

graph LR X[TV Ads] --> M[Brand Awareness] M --> Y[Sales] X --> Y style M fill:#f7f0e6,stroke:#8a6d4a

If you control for Awareness, you block the X→M→Y path, underestimating TV's total effect.

Rule: Only control for confounders, never mediators (unless you specifically want the direct effect).

MMM as a DAG

A Marketing Mix Model encodes specific causal assumptions. Here's the DAG implied by a standard MMM:

graph TB subgraph Confounders S[Seasonality] T[Trend] E[Economy] C[Competitor Activity] end subgraph Media TV[TV Spend] DIG[Digital Spend] OOH[OOH Spend] end Budget[Marketing Budget] --> TV Budget --> DIG Budget --> OOH S --> TV S --> Y[Sales] T --> Y E --> Y E --> Budget C --> Y TV --> Y DIG --> Y OOH --> Y style Y fill:#f0e6f7,stroke:#6d4a8a style Budget fill:#e6f0f7,stroke:#4a6d8a

Media Effect Identification

For media effects to be identified in an MMM, we need to satisfy the backdoor criterion by controlling for all common causes of media spend and sales.

Potential ConfounderEffect on SpendEffect on SalesAction
Seasonality Holiday campaigns Holiday demand Include seasonal controls
Trend Growing budgets Market growth Include trend component
Promotions Often co-occur with ads Direct sales lift Include promotion indicator
Competitor activity Defensive spending Market share shifts Include if observable

National Media, Geo-Level "Effects"

If media spend is national (same value for all geos), you cannot identify geo-level differential effects from observational data. Any apparent geo-level variation reflects correlation with timing patterns, not causal heterogeneity. See the Hierarchical Model section for details.

Counterfactual Contribution Analysis

MMM contribution analysis is fundamentally counterfactual: "How much of observed sales is attributable to each channel?" This requires computing:

$$\text{Contribution}_c = Y_{\text{observed}} - Y_{do(X_c = 0)}$$ (channel contribution)

This is the difference between actual sales and the counterfactual sales if channel c had been set to zero. The MMM Framework implements this via:

# Counterfactual contribution in the framework
contributions = model.compute_contributions(
    method="counterfactual",  # vs "marginal" decomposition
    baseline="zero"           # counterfactual: set channel to zero
)

# For each posterior draw, compute Y - Y_{do(X_c=0)}
# Returns full posterior distribution of contributions

Contributions Sum to More Than Total

With nonlinear (saturating) effects, the sum of individual counterfactual contributions typically exceeds total sales minus baseline. This is because removing all channels simultaneously is different from removing each one at a time (interaction effects). This is expected and correct.

Specification Shopping and Causal Inference

Specification shopping—trying multiple model specifications and selecting the one that gives "reasonable" results—is incompatible with causal inference.

Why Specification Shopping Breaks Causal Inference

Causal identification relies on a priori specification of the causal graph and functional forms. When you:

  • Remove channels with "wrong sign" coefficients
  • Add controls until results "look right"
  • Try different functional forms until ROIs are "reasonable"

...you invalidate all statistical inference. The reported uncertainty no longer reflects true uncertainty, and "causal" estimates become meaningless.

The Correct Approach

  1. Specify the DAG first — Based on domain knowledge, not data
  2. Identify adjustment sets — Use backdoor/frontdoor criteria
  3. Specify priors — Encode beliefs before seeing results
  4. Fit once and report — Accept wide uncertainty if that's what the data support
  5. Iterate scientifically — Model changes must be justified by theory, not by making results look better

The Preregistration Mindset

Treat your analysis plan like a preregistered experiment. Specify the model before looking at coefficient estimates. Document changes and their justifications. Report all models tried, not just the final one.

Key Takeaways

Draw the DAG First

Before any analysis, explicitly draw the assumed causal structure. This clarifies assumptions and identifies what needs to be controlled.

Intervention ≠ Observation

\(P(Y|X)\) and \(P(Y|do(X))\) are different quantities. Never confuse conditional probabilities with causal effects.

Counterfactuals Need Models

Counterfactual questions require structural models, not just data. The functional form matters—encode it thoughtfully.

References