Marginal Likelihood via the Kalman Filter
Summary
Running the Kalman filter not only estimates the state — it also evaluates the marginal likelihood of the model parameters , for free, as a by-product. The trick is the prediction-error decomposition: the joint density of the data factors into a product of one-step predictive densities, each a Gaussian in the filter’s innovation with covariance . This turns an intractable high-dimensional integral over states into a simple recursive sum — which is precisely what lets you put a prior on and run MCMC, MAP optimization, or EM. It is the engine behind parameter inference in BSTS.
Overview
To learn parameters (the entries of ) we want the marginal posterior . The states must be integrated out:
Doing this integral directly is intractable (and grows with ). The state-space structure dissolves it.
Main Content
Theorem: Prediction-error decomposition (Särkkä Thm. 12.1, Eq. 12.5)
The marginal likelihood factors into one-step predictive densities:
where each term is obtained by integrating the measurement model against the predicted state:
The predictive is exactly the Kalman prediction step, and the normalizer of the update step is this term — so the filter computes it at no extra cost.
Theorem: Energy function for the linear-Gaussian model (Särkkä Thm. 12.3, Eq. 12.38)
Define the energy function (unnormalized negative log-posterior) . For the linear-Gaussian model it is built recursively alongside the Kalman filter:
started from , where (innovation) and (innovation covariance) come straight from the Kalman update step. Each one-step predictive density is the Gaussian
The marginal parameter posterior is then .
Why this makes inference tractable
- The intractable integral is gone. The full likelihood is a recursive sum of cheap Gaussian terms — one Kalman pass evaluates exactly for any .
- MAP / ML estimates: (Eq. 12.13); the ML estimate is the same with a flat prior .
- MCMC: Metropolis–Hastings only needs the unnormalized posterior, i.e. ; the normalizing constant is never required. The acceptance ratio uses energy differences directly: So a single Kalman-filter run per proposal gives a full MCMC sampler over model parameters.
- EM: Fisher’s identity expresses as an expectation of the complete-data score under the smoothing distribution — the RTS smoother supplies the required sufficient statistics (Särkkä §12.3, Eq. 12.32).
- Laplace approximation: using the Hessian of (Eq. 12.15).
Examples
Posterior of the noise variance in the random walk (Särkkä Ex. 12.1)
For the Gaussian random-walk model, fix the process variance and treat the measurement variance as the unknown . With a flat prior , running the scalar Kalman filter and accumulating from Eq. (12.38) yields . Särkkä’s Fig. 12.1 shows the true sits in the high-density region — but the MAP/ML point estimate is biased low relative to the truth, illustrating why full posterior (MCMC) treatment beats point estimation.
Worked energy step (local level model)
Per time step, with scalar innovation and :
Summing over and adding gives the negative log-posterior used inside an MCMC loop.
Connections
- The Kalman Filter — supplies the innovations and covariances that form every likelihood term
- The RTS Smoother — supplies smoothing statistics for EM / Fisher’s-identity gradients
- Linear-Gaussian State-Space Models — the parameters being inferred
- State-Space Models and the Kalman Filter - Overview — pipeline context
See Also
- Bayesian Structural Time-Series Model — BSTS marginal likelihood / sampler is built on exactly this prediction-error decomposition
- MCMC Inference for CausalImpact — the Gibbs sampler draws variances/coefficients conditional on states, the complement to integrating states out here
- Spike-and-Slab Prior for Covariate Selection — variable selection relies on tractable marginal likelihoods of competing models
- Single Marketing Time Series — ARIMA likelihoods are evaluated the same way via the Kalman filter