What are some ways to uncover causal estimates from non-experimental data?

Summary

When randomization is impossible, causal estimates can be recovered through several “quasi-experimental” strategies, each exploiting a different source of exogenous variation or structural assumption. The core challenge — selection bias — is formalized in the potential outcomes framework as . The main families of solutions are: (1) selection-on-observables methods (CIA, DAG adjustment, IPW, doubly-robust), (2) panel/longitudinal designs (Differences-in-Differences, Fixed Effects), (3) discontinuity-based designs (RD), (4) instrumental variables, (5) comparative case methods (Synthetic Control), and (6) ML-based heterogeneous effect estimation (metalearners). Sensitivity analysis assesses robustness when all assumptions are uncertain.

Answer

The Core Problem: Selection Bias

All non-experimental causal identification confronts the same fundamental obstacle. For each individual , we observe only one potential outcome:

A naive comparison of treated vs. untreated groups conflates the average treatment effect on the treated (ATT) with selection bias — the systematic difference in baseline outcomes between groups:

Each strategy below eliminates or neutralizes this selection bias term through a different mechanism (The Selection Problem, MHE Ch. 2).


Strategy 1: Conditional Independence Assumption (CIA) — Selection on Observables

CIA / Unconfoundedness

Conditional on observed covariates , treatment is as good as randomly assigned. Also called ignorability (Potential Outcomes Framework) or unconfoundedness.

When it works: All confounders are observed and controlled for. Treatment selection is entirely driven by observable characteristics.

Implementations:

MethodMechanismNote
Regression adjustmentInclude as controls; Frequentist Causal Estimation
IPWRe-weight units by inverse propensity score Frequentist Causal Estimation
Doubly-robust (DR)Combines outcome model + IPW; consistent if either is correctFrequentist Causal Estimation
Propensity score matchingMatch on to create balanced treatment/control groupsFrequentist Causal Estimation

Bad Controls

Only condition on pre-treatment covariates. Variables affected by treatment (“bad controls”) open collider paths and introduce bias rather than removing it (Conditional Independence Assumption, ^def-collider).


Strategy 2: Causal Diagrams (DAGs) + Backdoor Adjustment

DAGs formalize which variables to condition on using graphical rules (Directed Acyclic Graphs).

Backdoor Adjustment Formula ( ^thm-backdoor)

If a set blocks all backdoor paths from treatment to outcome without opening collider paths or blocking front-door paths:

The three junction types determine what to condition on:

  • Fork : condition on to block confounding
  • Chain : do not condition on (blocks the causal path)
  • Collider : do not condition on (opens a spurious path)

A valid adjustment set blocks all backdoor paths while preserving front-door paths and avoiding collider activation. DAGs convert the CIA from a vague “control for confounders” to a precise graphical criterion.


Strategy 3: Instrumental Variables (IV)

When unobserved confounders make the CIA implausible, an instrument provides exogenous variation in treatment (Instrumental Variables, MHE Ch. 4).

Two IV requirements:

  1. Relevance: — instrument shifts treatment
  2. Exclusion restriction: — instrument affects outcome only through treatment

The IV estimand (Wald estimator):

2SLS — the standard implementation: regress treatment on the instrument to get , then regress outcome on .

IV Estimates LATE, Not ATE

With heterogeneous effects, IV identifies the Local Average Treatment Effect (LATE) — the causal effect only for compliers (those whose treatment status is changed by the instrument) (Local Average Treatment Effects). This has important external validity implications.

Classic instruments: Vietnam draft lottery (Angrist 1990), quarter of birth for schooling (Angrist & Krueger 1991), Maimonides’ Rule for class size (Angrist & Lavy 1999).


Strategy 4: Differences-in-Differences (DiD)

DiD uses panel data (units observed before and after treatment) to difference out time-invariant unobserved confounders (Differences-in-Differences, MHE Ch. 5).

The DiD estimator:

Underlying regression:

where are group fixed effects and are time effects.

Common Trends Assumption

The identifying assumption: . Treatment and control groups may have different levels but must share the same trend in the absence of treatment. Testable with pre-treatment data via parallel pre-trends plots.

Key example: Card & Krueger (1994) used NJ vs. PA before/after NJ minimum wage increase to show employment rose rather than fell.

A Bayesian DiD formulation (Bayesian Difference in Differences) encodes parallel trends as a model constraint and returns a full posterior over the treatment effect, making uncertainty explicit.


Strategy 5: Regression Discontinuity (RD)

When treatment is assigned by a rule based on a running variable crossing a threshold, comparing units just above and below the cutoff provides local causal estimates (Regression Discontinuity Designs, MHE Ch. 6).

Sharp RD — deterministic jump at cutoff :

Fuzzy RD — probabilistic jump at cutoff. Estimated as IV with as the instrument for , yielding LATE at the cutoff (Local Average Treatment Effects).

Validity checks: Pre-treatment covariates should show no jump at ; the density of the running variable should be smooth at (McCrary test for manipulation).

Classic examples: Incumbency advantage (Lee 2008, vote-share margin), class size effect (Maimonides’ Rule), scholarship eligibility cutoffs.


Strategy 6: Synthetic Control

For aggregate units (states, countries) where DiD has too few observations, synthetic control constructs a weighted combination of untreated units that mimics the treated unit’s pre-treatment trajectory (Synthetic Control).

Synthetic Control Estimator ( ^def-synth-estimator)

Weights , are chosen to minimize pre-treatment divergence between the treated unit and its synthetic version.

Treatment effect:

Inference uses permutation (placebo) tests — apply the same procedure to each donor unit and compare the resulting “effects” to the actual treated unit’s effect (^def-constrained-synth).

Classic example: California Proposition 99 cigarette tax — estimated −25 packs per capita by 2000, p-value ≈ 0.029 from placebo tests.

Extensions (Synthetic Control Extensions): penalized SC for multiple treated units, bias-corrected SC, elastic net, matrix completion. The Generalized Synthetic Control Method (Generalized Synthetic Control Method) unifies DiD and SC via an interactive fixed effects (IFE) model, handling multiple treated units and time-varying confounders.


Strategy 7: ML-Based Heterogeneous Effect Estimation (Metalearners)

Under the CIA, machine learning methods can estimate Conditional Average Treatment Effects (CATE) flexibly (Metalearners for CATE, Künzel et al. 2019).

LearnerDescriptionBest When
S-LearnerSingle model with treatment as featureWeak heterogeneity
T-LearnerSeparate models per treatment armBalanced groups
X-LearnerCross-imputes ITEs; then regresses on covariatesUnbalanced groups (large control, small treated)

X-Learner Minimax Optimality ( ^thm-x-learner)

For unbalanced designs with , the X-learner achieves rate vs. the T-learner’s . When the CATE is smoother than the response functions (), the X-learner exploits the full data to estimate CATE at the faster rate.


Strategy 8: Time-Series Causal Inference (BSTS / CausalImpact)

For interventions on a single time series, Bayesian Structural Time-Series (BSTS) models the pre-intervention trend and seasonality, then extrapolates the counterfactual (Bayesian Structural Time-Series Model, Brodersen et al. 2015). The treatment effect is the observed minus counterfactual path.

Key components: local linear trend, seasonality, spike-and-slab variable selection over control series, Gibbs+Kalman smoother MCMC. The posterior gives pointwise, cumulative, and running-average effect estimates.


Strategy 9: Sensitivity Analysis — Assessing Robustness

All observational strategies rely on untestable assumptions (unconfoundedness, exclusion restriction, common trends). Sensitivity analysis quantifies robustness to violations (Sensitivity Analysis in Observational Studies).

E-Value ( ^def-e-value)

The minimum risk-ratio association an unmeasured confounder would need with both treatment and outcome to explain away the observed effect. A large E-value = robust conclusion.

Other approaches: Rosenbaum & Rubin (1983) hidden binary confounder model, copula-based sensitivity (Franks et al. 2020) using transparent parametrization of non-identified quantities.


Summary Map

StrategyEliminates viaKey AssumptionEstimand
CIA / RegressionConditioning on All confounders observedATE / CATE
DAG Backdoor AdjustmentValid adjustment setCorrect causal graphATE
IV / 2SLSExogenous Exclusion restrictionLATE (compliers)
DiD / Fixed EffectsPanel differencingCommon trendsATT
Sharp RDThreshold comparisonContinuity at cutoffLATE at cutoff
Synthetic ControlConstructed counterfactualPre-trend matchATT (single unit)
Metalearners (X-learner)ML under CIAUnconfoundedness + overlapCATE
BSTS / CausalImpactTime-series extrapolationStationary control seriesATT (time-series)

Practical Implications

  • Start with DAGs: Before choosing a strategy, draw the causal graph. This dictates whether CIA is defensible, whether an instrument exists, and what to condition on.
  • CIA requires rich observables: If you suspect important unobserved confounders, CIA-based methods (regression, IPW, matching) will produce biased estimates — no amount of ML sophistication compensates.
  • IV trades ATE for LATE: IV gets around unobserved confounding but estimates a local effect for compliers only. Consider whether the complier subpopulation is the policy-relevant group.
  • DiD requires pre-treatment data: Always plot pre-trends to assess the common trends assumption; include leads in the event study specification.
  • Synthetic control is for aggregate units: It shines with a single treated state/country observed over many time periods; it is not appropriate for individual-level data.
  • Sensitivity analysis is not optional: Report E-values or conduct copula/Rosenbaum bounds analysis for observational conclusions. A finding that is non-robust to small confounding effects should not be presented as causal.

Source Notes

NoteRelevance
The Selection ProblemCore framing — potential outcomes and selection bias decomposition
Conditional Independence AssumptionCIA / unconfoundedness and regression adjustment
Directed Acyclic GraphsDAG-based identification, backdoor adjustment formula
Instrumental Variables2SLS, Wald estimator, exclusion restriction
Local Average Treatment EffectsLATE theorem, complier characterization
Differences-in-DifferencesFixed effects regression, common trends, Card & Krueger
Regression Discontinuity DesignsSharp and fuzzy RD, validity checks
Synthetic ControlConvex weights, placebo inference, Proposition 99
Generalized Synthetic Control MethodIFE model, multiple treated units, bootstrap inference
Frequentist Causal EstimationIPW, doubly-robust estimator theorem
Sensitivity Analysis in Observational StudiesE-value, copula sensitivity, Rosenbaum bounds
X-LearnerCross-imputation CATE estimation for unbalanced designs
Bayesian Difference in DifferencesBayesian DiD with full posterior over treatment effect
Bayesian Structural Time-Series ModelBSTS / CausalImpact for time-series interventions
Mostly Harmless Econometrics.pdfAngrist & Pischke (2008), Chs. 2–6
Li et al. - 2022 - Bayesian causal inference a critical review.pdfLi, Ding & Mealli (2022) — Bayesian CI framework
Künzel et al. - 2017 - Metalearners for estimating heterogeneous treatment effects using machine learning.pdfKünzel et al. (2019), PNAS — metalearners

Gaps

  • Regression Kink Designs (RKD): The vault covers RD designs but not the related RKD strategy (where the slope of the treatment function jumps at a threshold). No vault coverage — consider ingesting Card et al. (2015) or Nielsen et al. (2010).
  • Front-Door Criterion: DAGs note covers backdoor adjustment but the vault has limited coverage of Pearl’s front-door criterion for settings where all backdoor paths cannot be blocked. Consider ingesting Pearl (2009) Causality Ch. 3.
  • Staggered DiD / Callaway-Sant’Anna: The vault’s DiD note covers classical two-period DiD. Recent literature on heterogeneous-timing DiD (Callaway & Sant’Anna 2021, Goodman-Bacon 2021) is not yet ingested.
  • Causal Discovery / Structure Learning: The vault covers eliciting expert knowledge for DAGs (Knowledge Elicitation) but has limited coverage of automated causal discovery algorithms (PC, FCI, NOTEARS).

Follow-Up Questions

  • What is the LATE theorem, and when does an IV estimate the average treatment effect vs. only a local effect?
  • When does DiD fail, and how do staggered treatment timing designs address heterogeneity in DiD?
  • How does the Generalized Synthetic Control Method differ from classical synthetic control and DiD?
  • When should I use the X-learner vs. T-learner vs. S-learner for heterogeneous treatment effect estimation?
  • How do you build a causal DAG for a new research problem, and how do you choose the valid adjustment set?