Covariate Balance Diagnostics

Summary

Diagnosing match quality is “perhaps the most important step.” The goal is covariate balance — the matched treated and control groups should have similar empirical covariate distributions, $\tilde{p} (X ∣ T = 1) = \tilde{p} (X ∣ T = 0)$ . Numerical diagnostics (standardized difference in means, variance ratios) and graphical diagnostics (propensity-score distributions, QQ plots, before/after standardized-difference plots) are used. Crucially, balance, not the p-value of a balance hypothesis test, is the target.

Overview

Since matching only equates the observed covariates, balance is the in-sample property we can actually verify. Because no single summary captures a multivariate distribution, run several types of balance checks (means, variances, interactions, squares, QQ plots) for a fuller picture. All balance metrics should be computed the same way the outcome analysis will be run — within subclasses then aggregated for subclassification, and using the IPTW / variable-ratio / full-matching weights if those will be used in the analysis.

Main Content

Standardized difference in means (standardized bias) ^def-smd

For each covariate, the standardized difference in means is
$SMD = \frac{X ˉ _{t} - X ˉ _{c}}{σ _{t}},$
where $σ_{t}$ is the standard deviation in the full treated group (the same standardization is used before and after matching so the comparison is meaningful). It is like an effect size and is compared before vs. after matching. Compute it for each covariate and for two-way interactions and squares. For binary covariates, use the same formula or a simple difference in proportions.

Rubin (2001)'s three balance measures ^def-rubin-three

A comprehensive view of balance:

The standardized difference of means of the propensity score.

The ratio of the variances of the propensity score in the treated and control groups.

For each covariate, the ratio of the variances of the residuals orthogonal to the propensity score.

Rules of thumb for regression adjustment to be trustworthy: absolute standardized differences of means < 0.25, and variance ratios between 0.5 and 2.

Do not use hypothesis tests / p-values as balance measures ^warn-balance-test

Hypothesis tests and p-values that incorporate sample size (e.g., t-tests) should not be used to assess balance, for two reasons:

Balance is an in-sample property of the matched data — it makes no reference to a super-population, so a hypothesis test about a population is conceptually inappropriate.

Tests conflate balance with power. As matching discards controls, sample size (and power) falls, so a balance test’s p-value can rise — appearing to show improved balance simply because of reduced power. A test should not be used in a stopping rule when matched samples have varying sizes. Report standardized differences and variance ratios instead.

Graphical diagnostics ^def-graphical

Distribution of propensity scores across unmatched/matched treated and control units — also assesses common support; for weighting/subclassification, plot dot sizes proportional to weights.

Quantile-quantile (QQ) plots for continuous covariates — compare the quantiles of a variable in treated vs. control; identical distributions fall on the 45-degree line. Can also be done for squares and interactions (second moments).

Before/after standardized-difference plot (one line per covariate) — a quick overview of whether balance improved on each covariate.

Examples

Stuart and Green (2008), 1:1 nearest-neighbor on the propensity score: the propensity-score distribution plot shows matched treated and control units occupying the same range with a good match for each treated unit, while many unmatched controls fall outside that range. A companion plot of absolute standardized differences for 10 covariates shows nearly all dropping below the 0.2 threshold after matching — though a few covariates with small initial imbalance can worsen (they barely enter the propensity model); this matters only if those covariates are strongly related to the outcome, in which case add Mahalanobis matching on them within calipers.

Connections

Verifies the Propensity Score and the Balancing Property holds in the realized sample.
Step 3 of the workflow in Propensity Score Matching - Overview; feeds back into Matching Methods and Distance Measures when re-matching.
The propensity-score distribution plot doubles as a check of Common Support and Overlap.
Distinct from the unverifiable Conditional Independence Assumption (balance on observed covariates does not prove unconfoundedness); sensitivity analysis addresses the unobserved part — relevant to Omitted Variables Bias.

Second Brain

Explorer

Covariate Balance Diagnostics

Covariate Balance Diagnostics

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks