Matching Methods and Distance Measures

Summary

Matching has two ingredients: a distance measure defining how “close” two individuals are, and a matching structure that uses those distances to form matched groups. Stuart catalogs four affinely-invariant distances (exact, Mahalanobis, propensity-score, caliper / Mahalanobis-within-caliper) and a spectrum of structures (k:1 nearest neighbor, greedy vs. optimal, with/without replacement, subclassification, full matching, weighting) that differ in how many individuals remain and what weights they receive.

Overview

After choosing covariates, the distance between individuals and quantifies similarity. All four measures below are affinely invariant (matches are unchanged under affine transformations of the data). The matching structure then trades off bias vs. variance and which estimand (ATT vs. ATE) is supported. Methods can be thought of as assigning weights between 0 and 1: nearest-neighbor effectively gives weights of 0 or 1; subclassification, full matching, and weighting form a continuum (weighting is the limit of subclassification as subclasses → ∞, full matching in between).

Main Content

Distance measures ^def-distances

Four affinely-invariant distances between individuals and :

  1. Exact: if , and if .
  2. Mahalanobis: , where is the covariance matrix of in the full control group (for the ATT) or in the pooled groups (for the ATE).
  3. Propensity score: .
  4. Linear propensity score: .

Exact and Mahalanobis distances break down when is high-dimensional (exact leaves many unmatched; Mahalanobis treats all interactions as equally important and works best with < 8 mostly-continuous, normal covariates). Coarsened exact matching (CEM) relaxes exactness by matching on binned ranges.

Caliper and Mahalanobis-within-caliper matching ^def-caliper

A caliper forbids matches whose propensity scores differ by more than (typically 0.2-0.25 SD of the linear propensity score). Mahalanobis matching within propensity-score calipers combines both — match on Mahalanobis distance of “key covariates” only among pairs inside the caliper:

This yields samples well matched on the propensity score and particularly well matched on the key continuous covariates (e.g., baseline test scores).

k:1 nearest neighbor matching ^def-nn

For each treated individual , select the control(s) with smallest distance from . 1:1 is simplest and nearly always estimates the ATT (discarding unmatched controls). The apparent loss of power from discarding controls is usually minimal: precision is driven by the smaller group size, and similar groups reduce extrapolation. Ratio (k:1) matching uses more controls per treated unit — reducing variance but increasing bias (2nd/3rd-closest matches are farther). Variable-ratio matching lets the ratio vary across treated units (related to full matching).

Greedy vs. optimal matching ^def-optimal

Greedy nearest-neighbor matches treated units one at a time; the result can depend on order. Optimal matching minimizes a global distance across all matched sets, avoiding order-dependence. Gu and Rosenbaum (1993): optimal matching does not generally produce better-balanced groups than greedy, but does better at assigning controls to treated units — so greedy suffices for well-matched groups, optimal is preferable for well-matched pairs. Greedy performs poorly under intense competition for controls.

With vs. without replacement ^def-replacement

With replacement: a control can match multiple treated units — reduces bias (good controls reused) and is helpful when few comparable controls exist; order of matching then doesn’t matter, but inference is more complex (matched controls are not independent; use frequency weights equal to the number of times each control is used, and monitor reuse). Without replacement: each control used at most once.

Subclassification, full matching, and weighting ^def-subclass-full-weight

These use all individuals (vs. discarding in nearest-neighbor).

  • Subclassification: form groups (e.g., quintiles) of the propensity-score distribution; estimate effects within subclasses and aggregate. 5-10 subclasses typically remove ≥ 90% of the bias due to the propensity score; estimates ATE or ATT depending on aggregation weights.
  • Full matching: an optimal, automatic form of subclassification — each matched set has ≥ 1 treated and ≥ 1 control; minimizes the average within-set treated-control distance. Estimates ATE or ATT.
  • Weighting (IPTW): use the propensity score directly as inverse-probability-of-treatment weights, , weighting each group up to the full sample (Horvitz-Thompson logic). Weighting by the odds, , targets the ATT. Extreme weights (scores near 0/1) inflate variance; weight trimming caps weights, and doubly-robust methods combine weighting with outcome modeling.

Examples

Hansen (2004), SAT-coaching study: the original treated/control groups differed by 1.1 SD on the propensity score, but full matching produced matched sets differing by only 0.01-0.02 SD — retaining all controls while achieving near-optimal balance.

With-replacement frequency weights: if one treated unit is matched to 3 controls, each of those controls receives weight 1/3; a control matched to a single treated unit gets weight 1.

Connections

See Also