Marginal Likelihood via the Kalman Filter

Summary

Running the Kalman filter not only estimates the state — it also evaluates the marginal likelihood $p (y_{1 : T} ∣ θ)$ of the model parameters $θ$ , for free, as a by-product. The trick is the prediction-error decomposition: the joint density of the data factors into a product of one-step predictive densities, each a Gaussian in the filter’s innovation $v_{k}$ with covariance $S_{k}$ . This turns an intractable high-dimensional integral over states into a simple recursive sum — which is precisely what lets you put a prior on $θ$ and run MCMC, MAP optimization, or EM. It is the engine behind parameter inference in BSTS.

Overview

To learn parameters $θ$ (the entries of $A, H, Q, R, m_{0}, P_{0}$ ) we want the marginal posterior $p (θ ∣ y_{1 : T}) \propto p (y_{1 : T} ∣ θ) p (θ)$ . The states must be integrated out:

p (θ ∣ y_{1 : T}) = \int p (x_{0 : T}, θ ∣ y_{1 : T}) d x_{0 : T} .

Doing this integral directly is intractable (and grows with $T$ ). The state-space structure dissolves it.

Main Content

Theorem: Prediction-error decomposition (Särkkä Thm. 12.1, Eq. 12.5)

The marginal likelihood factors into one-step predictive densities:
$p (y_{1 : T} ∣ θ) = k = 1 \prod T p (y_{k} ∣ y_{1 : k - 1}, θ), (12.5)$
where each term is obtained by integrating the measurement model against the predicted state:
$p (y_{k} ∣ y_{1 : k - 1}, θ) = \int p (y_{k} ∣ x_{k}, θ) p (x_{k} ∣ y_{1 : k - 1}, θ) d x_{k} . (12.6)$
The predictive $p (x_{k} ∣ y_{1 : k - 1}, θ)$ is exactly the Kalman prediction step, and the normalizer $Z_{k}$ of the update step is this term — so the filter computes it at no extra cost.

Theorem: Energy function for the linear-Gaussian model (Särkkä Thm. 12.3, Eq. 12.38)

Define the energy function (unnormalized negative log-posterior) $φ_{T} (θ) = - lo g p (y_{1 : T} ∣ θ) - lo g p (θ)$ . For the linear-Gaussian model it is built recursively alongside the Kalman filter:
$φ_{k} (θ) = φ_{k - 1} (θ) + \frac{1}{2} lo g 2 π S_{k} (θ) + \frac{1}{2} v_{k}^{T} (θ) S_{k}^{- 1} (θ) v_{k} (θ), (12.38)$
started from $φ_{0} (θ) = - lo g p (θ)$ , where $v_{k}$ (innovation) and $S_{k}$ (innovation covariance) come straight from the Kalman update step. Each one-step predictive density is the Gaussian
$p (y_{k} ∣ y_{1 : k - 1}, θ) = N (y_{k} ∣ H_{k} m_{k}^{-}, S_{k}), S_{k} = H_{k} P_{k}^{-} H_{k}^{T} + R_{k} . (12.41)$
The marginal parameter posterior is then $p (θ ∣ y_{1 : T}) \propto exp (- φ_{T} (θ))$ .

Why this makes inference tractable

The intractable integral is gone. The full likelihood is a recursive sum of cheap Gaussian terms — one Kalman pass evaluates $φ_{T} (θ)$ exactly for any $θ$ .

MAP / ML estimates: $\hat{θ}^{MAP} = ar g min_{θ} φ_{T} (θ)$ (Eq. 12.13); the ML estimate is the same with a flat prior $p (θ) \propto 1$ .

MCMC: Metropolis–Hastings only needs the unnormalized posterior, i.e. $φ_{T} (θ)$ ; the normalizing constant $p (y_{1 : T})$ is never required. The acceptance ratio uses energy differences directly: $α_{i} = min {1, exp (φ_{T} (θ^{(i - 1)}) - φ_{T} (θ^{*})) \frac{q ( θ ^{(i - 1)} ∣ θ ^{*} )}{q ( θ ^{*} ∣ θ ^{(i - 1)} )}} . (12.17)$ So a single Kalman-filter run per proposal gives a full MCMC sampler over model parameters.

EM: Fisher’s identity expresses $\nabla φ_{T}$ as an expectation of the complete-data score under the smoothing distribution — the RTS smoother supplies the required sufficient statistics (Särkkä §12.3, Eq. 12.32).

Laplace approximation: $p (θ ∣ y_{1 : T}) \approx N (θ ∣ \hat{θ}^{MAP}, [H (\hat{θ}^{MAP})]^{- 1})$ using the Hessian of $φ_{T}$ (Eq. 12.15).

Examples

Posterior of the noise variance in the random walk (Särkkä Ex. 12.1)

For the Gaussian random-walk model, fix the process variance and treat the measurement variance $R$ as the unknown $θ$ . With a flat prior $p (R) \propto 1$ , running the scalar Kalman filter and accumulating $φ_{T} (R)$ from Eq. (12.38) yields $p (R ∣ y_{1 : T}) \propto exp (- φ_{T} (R))$ . Särkkä’s Fig. 12.1 shows the true $R$ sits in the high-density region — but the MAP/ML point estimate is biased low relative to the truth, illustrating why full posterior (MCMC) treatment beats point estimation.

Worked energy step (local level model)

Per time step, with scalar innovation $v_{k} = y_{k} - m_{k}^{-}$ and $S_{k} = P_{k}^{-} + R$ :
$φ_{k} = φ_{k - 1} + \frac{1}{2} lo g (2 π S_{k}) + \frac{v _{k}^{2}}{2 S _{k}} .$
Summing over $k$ and adding $- lo g p (θ)$ gives the negative log-posterior used inside an MCMC loop.

Connections

The Kalman Filter — supplies the innovations $v_{k}$ and covariances $S_{k}$ that form every likelihood term
The RTS Smoother — supplies smoothing statistics for EM / Fisher’s-identity gradients
Linear-Gaussian State-Space Models — the parameters $θ = (A, H, Q, R, \dots)$ being inferred
State-Space Models and the Kalman Filter - Overview — pipeline context

Second Brain

Explorer

Marginal Likelihood via the Kalman Filter

Marginal Likelihood via the Kalman Filter

Overview

Main Content

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks