Synthetic Control

Summary

Synthetic control is a causal inference method for single-treated-unit settings with aggregated panel data. Instead of finding one control unit, it builds a “synthetic” counterfactual as a convex combination (weighted average) of multiple untreated units, chosen to match the treated unit in the pre-treatment period. Called “the most important innovation in the policy evaluation literature in the last few years.” Inference uses Fisher’s Exact Test (permutation over placebo treatments).

Overview

The problem: Differences-in-Differences requires disaggregated data and needs the parallel trends assumption. When we only have aggregated (city-level, state-level) panel data on a single treated unit, DiD has undefined standard errors (degrees of freedom issue) and may have no appropriate control unit.

The solution: Construct a synthetic control — a weighted combination of untreated units calibrated to match the treated unit’s pre-treatment trajectory. Then compare the treated unit’s post-treatment outcome to the synthetic control’s.

Formal Setup

Synthetic Control Estimator

Suppose we have units. Unit 1 is treated; units form the donor pool. We observe outcomes for time periods, with periods before treatment.

The treatment effect at time for the treated unit is:

Since is observed but is not, we estimate:

The weights are chosen so the synthetic control matches the treated unit in the pre-treatment period.

Method 1: OLS / Unconstrained Regression

Treat the problem as an “upside-down” linear regression: instead of predicting an outcome from variables, we predict the treated unit from other units.

Setup: Pivot the data so each unit is a column and each time-period+feature combination is a row. Let = treated unit’s values, = donor pool matrix.

Fit OLS to get weights:

from sklearn.linear_model import LinearRegression
 
weights_lr = LinearRegression(fit_intercept=False).fit(X, y).coef_

Problem: With 38 states in the donor pool, OLS has 38 free parameters, leading to overfitting in the pre-treatment period (perfect fit) and wild extrapolation post-treatment. Negative and large positive weights create implausible “synthetic” units outside the data range.

Method 2: Constrained Optimization (Convex Combination)

Constrained Synthetic Control

The canonical synthetic control restricts weights to be a convex combination:

The optimal weights minimize:

subject to , , where reflect the importance of each predictor variable.

Why convex? This restricts the synthetic control to interpolation (within the convex hull of control units) rather than extrapolation, producing more credible counterfactuals.

California Cigarette Taxation (Proposition 99)

Setup: California passed Proposition 99 in 1988 (25 cent/pack cigarette tax). We want to estimate the effect on cigarette sales. Data: 1970–2000, 39 US states. California is treated; 38 others form the donor pool.

Features: cigsale (per-capita cigarette sales in packs) and retprice (retail price).

from scipy.optimize import fmin_slsqp
from functools import partial
 
def loss_w(W, X, y) -> float:
    return np.sqrt(np.mean((y - X.dot(W))**2))
 
def get_w(X, y):
    w_start = [1/X.shape[1]] * X.shape[1]
    weights = fmin_slsqp(
        partial(loss_w, X=X, y=y),
        np.array(w_start),
        f_eqcons=lambda x: np.sum(x) - 1,  # sum to 1
        bounds=[(0.0, 1.0)] * len(w_start),  # non-negative
        disp=False
    )
    return weights
 
# Pivot data: columns = states, rows = (feature × year)
inverted = (cigar.query("~after_treatment")
            .pivot(index='state', columns="year")[features]
            .T)
 
y = inverted[3].values      # California = state 3
X = inverted.drop(columns=3).values  # donor pool
calif_weights = get_w(X, y)

Result: Only 5 states get non-zero weight (sparse solution). Synthetic control closely tracks California pre-1988 without overfitting. Post-1988: the synthetic control is ~25 packs/year higher than actual California by 2000.

Interpretation: Proposition 99 reduced cigarette consumption by approximately 25 packs per capita per year by 2000, and the effect grew over time.

Why the Convex Constraint Helps

The convex constraint prevents extrapolation. The synthetic control is projected onto the convex hull of control units. This:

  • Produces smoother, more credible post-treatment trajectories
  • Creates sparse weights (many zeros) — only a few states matter
  • Does NOT achieve perfect pre-treatment fit (by design — not overfitting)

Inference: Fisher’s Exact Test

Standard errors are not well-defined for treated unit. Instead, use permutation inference:

Fisher's Exact Test for Synthetic Control

  1. For each control state , pretend it is the treated unit and compute its synthetic control using the remaining states as the donor pool.
  2. Compute the placebo treatment effect for each state:
  3. Compute the P-value:

Intuition: If no state was actually treated, the estimated effect should be near zero for all states. If the California effect is extreme relative to these “placebo” effects, it is statistically significant.

Inference for Proposition 99

from joblib import Parallel, delayed
 
def synthetic_control(state: int, data: pd.DataFrame) -> pd.DataFrame:
    """Compute synthetic control for a given state and return outcome + synthetic."""
    inverted = (data.query("~after_treatment")
                .pivot(index='state', columns="year")[features].T)
    y = inverted[state].values
    X = inverted.drop(columns=state).values
    weights = get_w(X, y)
    synthetic = (data.query(f"~(state=={state})")
                 .pivot(index='year', columns="state")["cigsale"]
                 .values.dot(weights))
    return (data.query(f"state=={state}")[["state","year","cigsale","after_treatment"]]
            .assign(synthetic=synthetic))
 
# Parallel computation for all 39 states
control_pool = cigar["state"].unique()
synthetic_states = Parallel(n_jobs=8)(
    delayed(partial(synthetic_control, data=cigar))(state)
    for state in control_pool
)
 
# Pre-treatment MSE filter (remove poorly fitted states)
def pre_treatment_error(state):
    pre = state.query("~after_treatment")
    return ((pre["cigsale"] - pre["synthetic"]) ** 2).mean()
 
# P-value: proportion of placebo effects more extreme than California's
effects = [state.query("year==2000").iloc[0]["cigsale"]
           - state.query("year==2000").iloc[0]["synthetic"]
           for state in synthetic_states
           if pre_treatment_error(state) < 80]
 
calif_effect = cigar.query("california & year==2000").iloc[0]["cigsale"] - calif_synth[-1]
# calif_effect ≈ -24.83
 
p_value = np.mean(np.array(effects) < calif_effect)
# p_value ≈ 0.029 (1 in 35 placebos more extreme)

Conclusion: California’s treatment effect of −24.83 packs is more extreme than 34 of 35 placebo effects. P-value ≈ 0.029 — statistically significant at 5%.

Comparison: Synthetic Control vs. DiD

AspectDifference-in-DifferencesSynthetic Control
Number of treated unitsMultiple (or few)Ideally one
Data levelDisaggregated OKOften aggregated
Control selectionPre-specifiedData-driven weighted combination
Key assumptionParallel trendsPre-treatment fit
InferenceStandard errorsFisher’s Exact Test (permutation)
Overfitting concernLowerHigh (use convex constraint)

Key Ideas

  1. Single-unit treatment with aggregated panel data → standard DiD fails
  2. Synthetic control = weighted average of donor pool, weights chosen to match pre-treatment trajectory
  3. OLS → overfits; constrained optimization (convex combination) → sparser, more credible weights
  4. Inference via permutation: pretend each control unit was treated, build its synthetic control, see how extreme the true effect is relative to placebo effects
  5. Pre-treatment fit quality: remove units with high pre-treatment error before the permutation test

Connections

See Also