Covariate Balance Diagnostics
Summary
Diagnosing match quality is “perhaps the most important step.” The goal is covariate balance — the matched treated and control groups should have similar empirical covariate distributions, . Numerical diagnostics (standardized difference in means, variance ratios) and graphical diagnostics (propensity-score distributions, QQ plots, before/after standardized-difference plots) are used. Crucially, balance, not the p-value of a balance hypothesis test, is the target.
Overview
Since matching only equates the observed covariates, balance is the in-sample property we can actually verify. Because no single summary captures a multivariate distribution, run several types of balance checks (means, variances, interactions, squares, QQ plots) for a fuller picture. All balance metrics should be computed the same way the outcome analysis will be run — within subclasses then aggregated for subclassification, and using the IPTW / variable-ratio / full-matching weights if those will be used in the analysis.
Main Content
Standardized difference in means (standardized bias) ^def-smd
For each covariate, the standardized difference in means is
where is the standard deviation in the full treated group (the same standardization is used before and after matching so the comparison is meaningful). It is like an effect size and is compared before vs. after matching. Compute it for each covariate and for two-way interactions and squares. For binary covariates, use the same formula or a simple difference in proportions.
Rubin (2001)'s three balance measures ^def-rubin-three
A comprehensive view of balance:
- The standardized difference of means of the propensity score.
- The ratio of the variances of the propensity score in the treated and control groups.
- For each covariate, the ratio of the variances of the residuals orthogonal to the propensity score.
Rules of thumb for regression adjustment to be trustworthy: absolute standardized differences of means < 0.25, and variance ratios between 0.5 and 2.
Do not use hypothesis tests / p-values as balance measures ^warn-balance-test
Hypothesis tests and p-values that incorporate sample size (e.g., t-tests) should not be used to assess balance, for two reasons:
- Balance is an in-sample property of the matched data — it makes no reference to a super-population, so a hypothesis test about a population is conceptually inappropriate.
- Tests conflate balance with power. As matching discards controls, sample size (and power) falls, so a balance test’s p-value can rise — appearing to show improved balance simply because of reduced power. A test should not be used in a stopping rule when matched samples have varying sizes. Report standardized differences and variance ratios instead.
Graphical diagnostics ^def-graphical
- Distribution of propensity scores across unmatched/matched treated and control units — also assesses common support; for weighting/subclassification, plot dot sizes proportional to weights.
- Quantile-quantile (QQ) plots for continuous covariates — compare the quantiles of a variable in treated vs. control; identical distributions fall on the 45-degree line. Can also be done for squares and interactions (second moments).
- Before/after standardized-difference plot (one line per covariate) — a quick overview of whether balance improved on each covariate.
Examples
Stuart and Green (2008), 1:1 nearest-neighbor on the propensity score: the propensity-score distribution plot shows matched treated and control units occupying the same range with a good match for each treated unit, while many unmatched controls fall outside that range. A companion plot of absolute standardized differences for 10 covariates shows nearly all dropping below the 0.2 threshold after matching — though a few covariates with small initial imbalance can worsen (they barely enter the propensity model); this matters only if those covariates are strongly related to the outcome, in which case add Mahalanobis matching on them within calipers.
Connections
- Verifies the Propensity Score and the Balancing Property holds in the realized sample.
- Step 3 of the workflow in Propensity Score Matching - Overview; feeds back into Matching Methods and Distance Measures when re-matching.
- The propensity-score distribution plot doubles as a check of Common Support and Overlap.
- Distinct from the unverifiable Conditional Independence Assumption (balance on observed covariates does not prove unconfoundedness); sensitivity analysis addresses the unobserved part — relevant to Omitted Variables Bias.