Spatial Models — Besag-York-Mollie (BYM)

Summary

The BYM model is the standard Bayesian approach for areal spatial data (counties, census tracts, etc.). It decomposes spatial variation into a structured component (ICAR prior, capturing spatial autocorrelation) and an unstructured component (i.i.d. random effects), with a mixture parameter controlling the balance.

When to Use BYM

  • Data type: Areal (discrete spatial units with neighbourhood structure), not point-referenced data
  • Goal: Partition variance into spatially-structured vs. unstructured components; estimate predictor effects after accounting for spatial autocorrelation
  • Scale: Computationally efficient — cost grows nearly linearly in number of areas (unlike full Gaussian Process)
  • For continuous spatial data, use a Gaussian Process instead

The ICAR Prior

Intrinsic Conditional Autoregressive (ICAR) is the engine of BYM. Given adjacency matrix (1 if areas share a border, 0 otherwise), ICAR encodes:

Key properties:

  • Penalises differences between neighbouring areas (Tobler’s first law of geography: nearby things are more similar)
  • is constrained to sum to zero (zero-mean)
  • Improper prior — must be embedded in a larger model with identifiable parameters
  • In PyMC: pm.ICAR("phi", W=W)

Graph Theory Vocabulary

TermSpatial Meaning
NodeAn area (census tract, county)
EdgeShared border between two areas
Adjacency matrix if areas share a border
Edge listEquivalent compact representation of

The BYM2 Parameterisation

The modern BYM formulation (Riebler et al., 2016) uses a scaled, identified mixture:

ParameterMeaning
Intercept (re-centres the mixture)
Joint scale of random effects
Fraction of variance that is spatially structured
Unstructured random effects
ICAR spatial effects (standardised by scaling factor )
Scaling factor (computed from )

The scaling factor

is the geometric mean of the diagonal of the pseudo-inverse of the graph Laplacian . It normalises so that , making and directly comparable and interpretable as a joint scale.

Intuition: Densely connected graphs (every area adjacent to every other) imply little variance — the scaling factor corrects for this so can be interpreted consistently across different graph structures.

PyMC Model (NYC Traffic Accidents)

with pm.Model(coords=coords) as BYM_model:
    beta0   = pm.Normal("beta0", 0, 1)            # intercept
    beta1   = pm.Normal("beta1", 0, 1)            # fragmentation effect
    theta   = pm.Normal("theta", 0, 1, dims="area_idx")  # unstructured RE
    phi     = pm.ICAR("phi", W=W_nyc, dims="area_idx")  # structured RE
    sigma   = pm.HalfNormal("sigma", 1)
    rho     = pm.Beta("rho", 0.5, 0.5)
 
    mixture = pt.sqrt(1 - rho) * theta + pt.sqrt(rho / scaling_factor) * phi
    mu = pt.exp(log_E + beta0 + beta1 * fragment_index + sigma * mixture)
    y_i = pm.Poisson("y_i", mu, observed=y)
  • Offset log_E: log-population, makes mu an excess risk rate (not raw count)
  • Poisson likelihood: appropriate for count outcomes (traffic accidents, disease cases)

Variance Decomposition

After fitting, visualise each component separately:

VisualisationSetInterpretation
Spatial smoothing, What spatial structure alone explains
Predictor effectWhat fragmentation index alone explains
Unstructured residuals, Remaining unstructured noise

Spatial smoothing is useful for forecasting: low-accident tracts surrounded by high-accident neighbourhoods will likely regress toward their neighbours in the future (partial pooling in space).

Connections

See Also

Source