Causal Inference for Analysts

A practical guide to causal reasoning using Judea Pearl's framework of structural causal models, do-calculus, and counterfactual analysis—with applications to Marketing Mix Modeling.

Key References

This guide draws primarily from Pearl, Glymour, & Jewell (2016) Causal Inference in Statistics: A Primer, Pearl (2009) Causality, and Pearl & Mackenzie (2018) The Book of Why. For MMM-specific applications, see also Cunningham (2021) Causal Inference: The Mixtape.

🪜 Prefer to learn it hands-on?

The eleven-part Causal Inference in Practice series walks every idea on this page through executable notebooks — confounded worlds with sealed answer keys, structural mediation, latent confounders, experiment calibration, and a closed measurement loop that converges on the truth.

📜 The identification contract

Everything this guide teaches comes due as assumptions. The framework states all seven in one place — formally, labeled testable or untestable, priced against the pressure-testing scorecard: the identification contract.

Why Causal Inference?

Marketing Mix Modeling asks fundamentally causal questions: What is the effect of TV advertising on sales? If we had spent more on digital, how much more would we have sold? These questions cannot be answered by correlation alone.

Correlation ≠ Causation

Ice cream sales and drowning deaths are correlated. But eating ice cream doesn't cause drowning— both are caused by summer heat. Confusing correlation with causation leads to wrong decisions.

Prediction ≠ Intervention

A model can predict sales perfectly from ad spend historically, yet give completely wrong answers about what happens if we change ad spend. Prediction is about patterns; causation is about mechanisms.

The goal of causal inference is to answer interventional and counterfactual questions from observational data—when experiments are impossible or impractical.

The Fundamental Problem of Causal Inference

The Fundamental Problem

For any individual unit at any moment in time, we can observe at most one potential outcome— the outcome under the treatment actually received. The outcome under the alternative treatment is forever unobservable. This missing data problem is the fundamental challenge of causal inference.

Consider a single week where we spent 100K USD on TV. We observed sales of 1M USD. What would sales have been if we had spent 0 USD on TV that week? We can never observe this directly— we can't rewind time and run the alternative scenario.

$$\text{Causal Effect} = Y^{a=1} - Y^{a=0}$$ (individual effect)

Where $Y^{a=1}$ is the potential outcome under treatment and $Y^{a=0}$ is the potential outcome under control. We observe one; the other is the counterfactual.

Structural Causal Models

A Structural Causal Model (SCM) is a mathematical object that encodes causal relationships. It consists of three components:

Definition: Structural Causal Model

An SCM $\mathcal{M} = \langle U, V, F \rangle$ consists of:

U: Exogenous variables (external, not caused by anything in the model)
V: Endogenous variables (determined by variables in the model)
F: Structural equations $v_i = f_i(\text{pa}_i, u_i)$ for each $v_i \in V$

Example: Simple Marketing SCM

\begin{aligned} \text{Budget} &= U_{\text{budget}} \\ \text{AdSpend} &= f_1(\text{Budget}, U_{\text{ad}}) \\ \text{Awareness} &= f_2(\text{AdSpend}, U_{\text{aware}}) \\ \text{Sales} &= f_3(\text{Awareness}, \text{AdSpend}, U_{\text{sales}}) \end{aligned}

The structural equations are asymmetric: $\text{AdSpend} = f(\text{Budget})$ means Budget causes AdSpend, not the reverse. This asymmetry encodes the direction of causation.

Directed Acyclic Graphs (DAGs)

Every SCM implies a Directed Acyclic Graph (DAG) where nodes are variables and edges point from causes to effects. DAGs provide a visual language for causal reasoning. In an MMM, a TV spend node sends an edge into sales—but an underlying demand node (holiday season, category growth) sends edges into both spend and sales, which is exactly what makes it a confounder.

graph LR B[Budget] --> A[Ad Spend] A --> W[Awareness] A --> S[Sales] W --> S style B fill:#f0f7e6,stroke:#6d8a4a style A fill:#e6f0f7,stroke:#4a6d8a style W fill:#f7f0e6,stroke:#8a6d4a style S fill:#f0e6f7,stroke:#6d4a8a

Budget → Ad Spend → Awareness → Sales, with a direct path Ad Spend → Sales

d-Separation

d-separation is a graphical criterion for reading conditional independence from a DAG. Two variables are d-separated (conditionally independent) given a set Z if every path between them is "blocked." In plain terms: the graph tells you which variables can still pass information to each other once you have controlled for a given set—so it tells you what to control for, and what to leave alone.

Path Blocking Rules

A path is blocked by conditioning set Z if:

The path contains a chain A → B → C or fork A ← B → C, and B ∈ Z
The path contains a collider A → B ← C, and B ∉ Z (and no descendant of B is in Z)

The Three Causal Building Blocks

Chain (Mediation)

graph LR A((A)) --> B((B)) --> C((C))

A causes C through B. Conditioning on B blocks the path from A to C.

Example: Ad → Awareness → Sales

Fork (Confounding)

graph LR B((B)) --> A((A)) B --> C((C))

B causes both A and C, creating spurious correlation. Conditioning on B blocks this.

Example: Season → Ice Cream, Season → Drowning

Collider (Selection)

graph LR A((A)) --> B((B)) C((C)) --> B

A and C both cause B. They're independent, but conditioning on B creates spurious association.

Example: Talent → Hollywood, Beauty → Hollywood

Pearl's Ladder of Causation

Pearl's Ladder of Causation (also called the Causal Hierarchy) describes three levels of causal reasoning, each requiring different information and enabling different queries.

Level 3

Counterfactuals (Imagining)
"What would Y have been if X had been different, given what I observed?"
Query: $P(Y_x | X', Y')$ — Requires full SCM with specific $U$ values
Marketing: "What would last quarter have looked like without the promo?"

Level 2

Intervention (Doing)
"What would happen to Y if I set X to a specific value?"
Query: $P(Y | do(X=x))$ — Requires causal graph structure
Marketing: "What happens to sales if we double TV spend?"

Level 1

Association (Seeing)
"What does observing X tell me about Y?"
Query: $P(Y | X=x)$ — Requires only observational data
Marketing: "Search spend and sales move together."

The Hierarchy is Strict

You cannot answer Level 2 questions with Level 1 data alone, nor Level 3 questions with Level 2 information alone—no matter how much data you have. Each level requires additional assumptions (encoded in the causal model) to climb the ladder.

Level 1: Association

Association queries ask about statistical dependencies: How are X and Y related in the data?

$$P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)}$$ (conditional probability)

Standard machine learning and predictive modeling operate at Level 1. Given historical data, they learn $P(Y|X)$—the distribution of Y given we observe X. This is sufficient for prediction but not for intervention.

Level 2: Intervention

Intervention queries ask what happens when we act: If I set X to x, what happens to Y?

$$P(Y=y | do(X=x))$$ (interventional distribution)

The $do(\cdot)$ operator represents an intervention—physically setting a variable to a value, rather than passively observing it. This is fundamentally different from conditioning.

Interactive: Seeing vs. Doing

Compare the observational and interventional distributions for a confounded system.

Confounder strength: 0.7

Level 3: Counterfactuals

Counterfactual queries ask about alternative histories: Given what actually happened, what would have happened under different circumstances?

$$P(Y_{X=x'} = y | X=x, Y=y)$$ (counterfactual probability)

This asks: "For a unit that actually received X=x and had outcome Y=y, what would Y have been if X had been x' instead?" This is the language of individual causal effects.

MMM Counterfactuals

"Given that we spent 100K USD on TV and observed 1M USD in sales, how much would we have sold if we had spent 0 USD on TV?" This is a counterfactual question—it's about a specific week that already happened. A regional holdout test is the closest physical approximation: by keeping media dark in matched regions, you let part of the market live out the counterfactual world the model can otherwise only compute.

The do-Operator

The $do(\cdot)$ operator formalizes intervention. When we write $do(X=x)$, we mean:

Delete all arrows pointing into X in the causal graph
Set X to the value x
Compute the resulting distribution over other variables

This "graph surgery" reflects what happens in an experiment: we override the natural causes of X and force it to take a specific value. In marketing terms, $do(\cdot)$ is the difference between observing the weeks when spend happened to be high—often the very weeks demand was high anyway—and forcing spend high regardless of conditions, the way a geo lift test does.

Observational

graph TB Z[Confounder Z] --> X[Treatment X] Z --> Y[Outcome Y] X --> Y style Z fill:#f7e6e6,stroke:#8a4a4a

P(Y|X) confounded by Z

Interventional

graph TB Z[Confounder Z] --> Y[Outcome Y] X[do X = x] --> Y style X fill:#e6f7e6,stroke:#4a8a4a style Z fill:#f0f0f0,stroke:#999

P(Y|do(X)) – Z no longer confounds

The Three Rules of do-Calculus

Pearl's do-calculus provides three rules for manipulating expressions containing $do(\cdot)$. Together, they are complete: any identifiable causal effect can be computed using these rules.

Deep diveThe formal statements and graph conditions of the three rules

Rule 1: Insertion/Deletion of Observations

$$P(Y | do(X), Z, W) = P(Y | do(X), W)$$

Condition: $Y \perp\!\!\!\perp Z | X, W$ in the graph $G_{\overline{X}}$ (the graph with all arrows into X deleted).

Intuition: If Z doesn't affect Y once we've intervened on X and conditioned on W, we can ignore Z.

Rule 2: Action/Observation Exchange

$$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)$$

Condition: $Y \perp\!\!\!\perp Z | X, W$ in the graph $G_{\overline{X}, \underline{Z}}$ (arrows into X deleted, arrows out of Z deleted).

Intuition: If intervening on Z has the same effect as observing Z (given the other conditions), we can replace $do(Z)$ with observation.

Rule 3: Insertion/Deletion of Actions

$$P(Y | do(X), do(Z), W) = P(Y | do(X), W)$$

Condition: $Y \perp\!\!\!\perp Z | X, W$ in the graph $G_{\overline{X}, \overline{Z(W)}}$ where $Z(W)$ is the set of Z-nodes not ancestors of any W-node in $G_{\overline{X}}$.

Intuition: If Z has no effect on Y given our other interventions and observations, we can drop $do(Z)$.

Identification

A causal effect is identifiable if it can be computed from observational data plus the causal graph structure, without knowing the functional forms $f_i$.

Identifiability

The causal effect $P(Y|do(X))$ is identifiable if it can be expressed as a function of the observational distribution $P(V)$ alone. If it cannot, no amount of observational data will reveal the causal effect—an experiment is required.

The Backdoor Criterion

The backdoor criterion provides a simple sufficient condition for identification.

Backdoor Criterion

A set of variables Z satisfies the backdoor criterion relative to (X, Y) if:

No node in Z is a descendant of X
Z blocks all backdoor paths from X to Y (paths with an arrow into X)

If Z satisfies the backdoor criterion, the causal effect is identified by the backdoor adjustment formula:

$$P(Y | do(X)) = \sum_z P(Y | X, Z=z) \cdot P(Z=z)$$ (backdoor adjustment)

This is the formal version of what an MMM does when it controls for demand: measure proxies for the confounder—seasonality, trend, pricing, macro indicators—and adjust for them, so that the remaining spend-to-sales relationship reflects contribution rather than coincidence.

Interactive: Backdoor Adjustment

See how adjusting for the confounder recovers the true causal effect.

True causal effect: 0.5

Confounding: 0.8

The Frontdoor Criterion

When the backdoor criterion fails (no valid adjustment set exists), the frontdoor criterion may still identify the effect through a mediator.

Frontdoor Criterion

A set of variables M satisfies the frontdoor criterion relative to (X, Y) if:

M intercepts all directed paths from X to Y
There is no unblocked backdoor path from X to M
All backdoor paths from M to Y are blocked by X

graph LR U[Unobserved U] -.-> X[Treatment X] U -.-> Y[Outcome Y] X --> M[Mediator M] M --> Y style U fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5

U confounds X→Y, but M satisfies the frontdoor criterion

$$P(Y | do(X)) = \sum_m P(M=m | X) \sum_{x'} P(Y | M=m, X=x') P(X=x')$$ (frontdoor formula)

Instrumental Variables

An instrumental variable Z affects X but has no direct effect on Y (except through X). It provides identification when direct backdoor adjustment is impossible.

graph LR Z[Instrument Z] --> X[Treatment X] U[Unobserved U] -.-> X U -.-> Y[Outcome Y] X --> Y style U fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5 style Z fill:#e6f7e6,stroke:#4a8a4a

For linear models, the IV estimate is:

$$\hat{\beta}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, X)}$$ (IV estimator)

Computing Counterfactuals

Counterfactual reasoning requires three steps:

1. Abduction

Given the evidence (what we observed), infer the values of exogenous variables U that are consistent with the observation.

2. Action

Modify the structural equations to reflect the hypothetical intervention (set X to the counterfactual value).

3. Prediction

Use the modified model with the inferred U values to compute the counterfactual outcome.

Deep diveA worked counterfactual: abduction, action, prediction on a linear SCM

Example: Counterfactual Sales

Consider a simple linear SCM:

\begin{aligned} X &= U_X \\ Y &= \beta X + U_Y \end{aligned}

Observation: We observed X=100 (spent 100K USD) and Y=500 (500K USD sales).

Question: What would Y have been if X had been 0?

Step 1 (Abduction): From Y = βX + U_Y and the observation, infer U_Y = 500 - 100β.

Step 2 (Action): Set X = 0 in the modified model.

Step 3 (Prediction): Y_{X=0} = β(0) + U_Y = 500 - 100β.

If β = 3 (each 1K USD in ad spend generates 3K USD in sales), then Y_{X=0} = 500 - 300 = 200K USD.

The Role of the Structural Model

Counterfactual computation requires knowing (or estimating) the structural equations, not just the DAG. The functional form matters—linear models give different counterfactuals than nonlinear models.

Common Pitfalls

Confounding

Confounding occurs when a common cause of treatment and outcome creates a spurious association.

graph TB Z[Seasonality] --> X[Ad Spend] Z --> Y[Sales] X --> Y style Z fill:#f7e6e6,stroke:#8a4a4a

Seasonality increases both ad spend (holiday campaigns) and sales (holiday demand), inflating the apparent effect of ads.

Solution: Control for confounders via backdoor adjustment, or use experimental designs.

Collider Bias

Collider bias (selection bias, Berkson's paradox) occurs when you condition on a common effect of two variables, creating a spurious association between them.

graph TB X[Ad Quality] --> C[Brand Success] Y[Market Luck] --> C style C fill:#f7e6e6,stroke:#8a4a4a

Among successful brands (conditioning on C), ad quality and luck appear negatively correlated— but they're actually independent.

Never Condition on Colliders

Conditioning on a collider (or its descendants) opens a spurious path between its causes. This is why controlling for "everything" can make estimates worse, not better.

Overcontrol Bias

Overcontrol bias occurs when you adjust for a mediator on the causal pathway, blocking part of the effect you're trying to measure.

graph LR X[TV Ads] --> M[Brand Awareness] M --> Y[Sales] X --> Y style M fill:#f7f0e6,stroke:#8a6d4a

If you control for Awareness, you block the X→M→Y path, underestimating TV's total effect.

Rule: Only control for confounders, never mediators (unless you specifically want the direct effect).

MMM as a DAG

A Marketing Mix Model encodes specific causal assumptions. Here's the DAG implied by a standard MMM:

graph TB subgraph Confounders S[Seasonality] T[Trend] E[Economy] C[Competitor Activity] end subgraph Media TV[TV Spend] DIG[Digital Spend] OOH[OOH Spend] end Budget[Marketing Budget] --> TV Budget --> DIG Budget --> OOH S --> TV S --> Y[Sales] T --> Y E --> Y E --> Budget C --> Y TV --> Y DIG --> Y OOH --> Y style Y fill:#f0e6f7,stroke:#6d4a8a style Budget fill:#e6f0f7,stroke:#4a6d8a

Media Effect Identification

For media effects to be identified in an MMM, we need to satisfy the backdoor criterion by controlling for all common causes of media spend and sales.

Potential Confounder	Effect on Spend	Effect on Sales	Action
Seasonality	Holiday campaigns	Holiday demand	Include seasonal controls
Trend	Growing budgets	Market growth	Include trend component
Promotions	Often co-occur with ads	Direct sales lift	Include promotion indicator
Competitor activity	Defensive spending	Market share shifts	Include if observable

National Media, Geo-Level "Effects"

If media spend is national (same value for all geos), you cannot identify geo-level differential effects from observational data. Any apparent geo-level variation reflects correlation with timing patterns, not causal heterogeneity. See the Hierarchical Model section for details.

Counterfactual Contribution Analysis

MMM contribution analysis is fundamentally counterfactual: "How much of observed sales is attributable to each channel?" This requires computing:

$$\text{Contribution}_c = Y_{\text{observed}} - Y_{do(X_c = 0)}$$ (channel contribution)

This is the difference between actual sales and the counterfactual sales if channel c had been set to zero. The MMM Framework implements this via:

# Counterfactual contribution in the framework
# (BayesianMMM.compute_counterfactual_contributions)
contributions = model.compute_counterfactual_contributions(
    compute_uncertainty=True,  # propagate the full posterior into HDIs
    hdi_prob=0.9,
    random_seed=42,
)

# For each channel: Y_pred(actual spend) - Y_pred(channel zeroed out),
# computed per posterior draw, so uncertainty carries through
print(contributions.total_contributions)    # total contribution per channel
print(contributions.contribution_hdi_low)   # 90% HDI bounds
print(contributions.contribution_hdi_high)
print(contributions.summary())              # tidy DataFrame of the above

A Convention, Not a Bug: Contributions Need Not Sum to the Total

Be explicit about the convention in play. The counterfactual (zero-out) contribution of a channel answers its own question—what changes if this one channel goes dark?—and with nonlinear (saturating) effects, the sum of these one-at-a-time answers need not equal total sales minus baseline. Removing all channels simultaneously is a different counterfactual from removing each one at a time, so overlap or shortfall between the two is expected and correct, not an error to be reconciled away. The normalized decomposition used elsewhere—for instance the percentage shares in contribution_pct and the stacked decompositions in reports—is a different convention that rescales effects to sum exactly, trading counterfactual interpretability for additivity. Use each for what it is; never mix the two in one table.

Specification Shopping and Causal Inference

Specification shopping—trying multiple model specifications and selecting the one that gives "reasonable" results—is incompatible with causal inference.

Why Specification Shopping Breaks Causal Inference

Causal identification relies on a priori specification of the causal graph and functional forms. When you:

Remove channels with "wrong sign" coefficients
Add controls until results "look right"
Try different functional forms until ROIs are "reasonable"

...you invalidate all statistical inference. The reported uncertainty no longer reflects true uncertainty, and "causal" estimates become meaningless.

The Correct Approach

Specify the DAG first — Based on domain knowledge, not data
Identify adjustment sets — Use backdoor/frontdoor criteria
Specify priors — Encode beliefs before seeing results
Fit once and report — Accept wide uncertainty if that's what the data support
Iterate scientifically — Model changes must be justified by theory, not by making results look better

The Preregistration Mindset

Treat your analysis plan like a preregistered experiment. Specify the model before looking at coefficient estimates. Document changes and their justifications. Report all models tried, not just the final one.

Key Takeaways

Draw the DAG First

Before any analysis, explicitly draw the assumed causal structure. This clarifies assumptions and identifies what needs to be controlled.

Intervention ≠ Observation

$P(Y|X)$ and $P(Y|do(X))$ are different quantities. Never confuse conditional probabilities with causal effects.

Counterfactuals Need Models

Counterfactual questions require structural models, not just data. The functional form matters—encode it thoughtfully.

From Theory to Practice in This Framework

Everything on this page has a concrete counterpart in how the framework works. The DAG discipline becomes explicit variable roles—each control is declared a confounder, a precision control, or a mediator, so the model adjusts for forks without blocking chains (see Variable Selection). The preregistration mindset becomes pre-specification: the model design is locked before results are seen, closing the door on specification shopping. And because observational data alone cannot always climb the ladder, experiment calibration folds real interventions—geo lift tests, matched-market designs—back into the model as priors or in-graph likelihoods (see Measurement & Calibration). For a hands-on tour of these features working together, see the Causal Features Showcase.

References

Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Pearl, J., Glymour, M., & Jewell, N. P. (2016). Causal Inference in Statistics: A Primer. Wiley.
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. [Free online]
Hernán, M. A., & Robins, J. M. (2020). Causal Inference: What If. Chapman & Hall/CRC. [Free online]
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference. MIT Press.