DAGs and Causal Identification

Summary

Directed Acyclic Graphs (DAGs) are the primary tool for representing, visualizing, and reasoning about causal structures in observational data. They encode causal assumptions, identify confounders, and guide which variables to condition on to isolate treatment effects. Key concepts include forks, chains, colliders, d-separation, and the backdoor adjustment formula.

Overview

Causal inference cannot be established from data alone — data must be supplemented by a causal model that encodes assumptions about the direction and nature of relationships between variables. DAGs provide this model in a compact, visual, and mathematically precise way.

Causal Inference

Causal inference is the process of reasoning and the application of conclusions drawn from cause-and-effect relationships between variables while taking into account potential confounding factors and biases.

Core vocabulary:

  • Treatment (): The action or intervention whose effect we want to measure (the “independent variable”).
  • Outcome (): The variable we want to understand (the “dependent variable”).
  • Confounder: A variable that causally affects both treatment and outcome, creating spurious association between them.

Basic DAG Structure

A DAG consists of nodes (variables) connected by directed edges (arrows indicating causal direction). The graph is acyclic — no variable can cause itself through a chain of relationships.

Path in a DAG

A path is any sequence of causal links (arrows) connecting the treatment and outcome, regardless of arrow direction. In the DAG , the path is a valid path even though arrows point in multiple directions.

Confounding and De-confounding

Confounders

A confounder is a variable that causes both treatment and outcome : . The confounder creates a spurious association between and — observing the relationship without adjusting for will yield a biased estimate of the causal effect.

Approaches to de-confounding:

ApproachWhen applicableMechanism
Randomized Control Trial (RCT)Prospective studyRandom assignment breaks link
StratificationObservational, discrete confoundersCompute effect within strata of , then average
Conditioning / Backdoor adjustmentObservational, anyMathematical adjustment using Pearl’s formula
ControllingReal-world trialHold fixed experimentally

Conditioning vs. Controlling

  • Conditioning: Achieves the effect of an RCT mathematically on historical data using backdoor adjustment.
  • Controlling: Literally holds a variable constant in a real-world trial.

Key insight from Pearl: We should condition only on confounders — conditioning on too many variables can introduce new biases (see: colliders below).

The Three Junction Patterns

Every node in the interior of a path (not treatment or outcome endpoints) participates in exactly one of three junction patterns:

Fork

is a common cause of and — a confounder. Conditioning on blocks the path; leaving unconditioned leaves the path open.

Shoe Size and Reading Ability

Age () causes both shoe size () and reading ability (): .

  • Unconditioned: Strong spurious correlation between and (larger shoes → better readers — confounded by age).
  • Conditioned on age: Flat regression line — no correlation between and within a fixed age group. This is Simpson’s Paradox: an association that appears or disappears when you aggregate/disaggregate.

Rule: Conditioning on the intermediate node in a fork blocks the path.

Chain

is a mediator — it transmits the causal effect of on . Conditioning on blocks the path (kills the causal transmission); leaving unconditioned leaves the path open.

Drug → Blood Pressure → Recovery

: the drug affects recovery through blood pressure.

  • Unconditioned: Strong correlation between and .
  • Conditioned on : and become uncorrelated — the mediating pathway is blocked.

Rule: Conditioning on the intermediate node in a chain blocks the path.

Note: Chains and forks produce identical data patterns — you cannot tell them apart from data alone, which is why DAGs are necessary.

Collider

is a collider — two causal arrows collide at . This is the critical pattern where intuition fails:

  • Unconditioned: path is blocked (no spurious association between and ).
  • Conditioned on : path is unblocked — conditioning on the collider creates a spurious association.

Sports Ability, Academic Ability, and Bursaries

: both sporting () and academic ability () cause bursary awards ().

  • Unconditioned: No correlation between and (high ability in either is rare independently).
  • Conditioned on : At a fixed bursary score, students with low sports ability tend to have high academic ability and vice versa — a negative correlation is induced.

Rule: Conditioning on the intermediate node in a collider unblocks the path.

Summary of Conditioning Rules

JunctionUnconditionedConditioned
Fork Open (spurious)Blocked
Chain OpenBlocked
Collider BlockedOpen (spurious!)

Backdoor Paths and Backdoor Adjustment

Backdoor Path

A backdoor path from treatment to outcome is any path starting with an arrow pointing into (i.e., ). These paths carry spurious information and must be blocked. A front-door path starts with an arrow out of (i.e., ). These paths carry the causal effect and must remain open.

Backdoor Adjustment Formula

To estimate the causal effect of on with confounder :

The operator represents an intervention (setting to a value), not mere observation. The right-hand side is expressed entirely in observable quantities.

Intuition: Stratify by , compute the effect within each stratum, then average over the population distribution of .

Valid Adjustment Sets

Valid Adjustment Set

A set of nodes is a valid adjustment set if, when conditioned on, it:

  1. Blocks and closes all backdoor paths from to , and
  2. Leaves at least one front-door path from to unblocked and open.

Finding valid adjustment sets in practice: Use dagitty (R) or dowhy (Python):

library(dagitty)
adjustmentSets(dag)  # returns all valid adjustment sets

The optimal adjustment set is a valid adjustment set with the minimum number of nodes.

Worked Example

For the DAG with paths:

  1. (backdoor)
  2. (backdoor; is a collider here)
  3. (backdoor; is a fork)
  4. (backdoor)
  5. (front-door; is a collider — must condition on to open this path)

Valid adjustment sets: , , .

d-Separation and d-Connection

d-Separation

A path is d-separated (blocked) by conditioning set if and only if:

  1. contains a fork or chain where , or
  2. contains a collider where and no descendant of is in . If none of these apply, the path is d-connected (unblocked).

Practical note: The descendant-of-collider condition (condition 2) is rare in practice but appears frequently in the technical literature. Conditioning on a descendant of a collider has the same effect as conditioning on the collider itself.

Terminology Reference

TermMeaning
Exogenous variableHas no incoming arrows — causes others but is not caused within the model
Endogenous variableHas at least one incoming arrow — its value is explained within the model
Unobserved confounderA confounder that is not measured; creates an unresolvable backdoor path
Unconditional dependenceA path that is open without any conditioning
Conditional independenceA path that is blocked by conditioning on a set of nodes
MediatorMiddle node of a chain; transmits causal effect
CovariateA variable that affects the outcome but is not of primary interest; may be added to improve precision but is not required for identification

Connections

See Also