Expected Information Gain

Summary

The expected information gain (EIG) is the objective function of Bayesian experimental design: the expected reduction in entropy (uncertainty) about the latent variable $θ$ from running an experiment with design $ξ$ , averaged over not-yet-observed outcomes $y$ . It equals the mutual information $MI_{ξ} (θ; y)$ . The Bayesian optimal design is $ξ^{\*} = ar g max_{ξ} EIG (ξ)$ . This note gives the four equivalent forms of the EIG and why it is hard to compute.

Overview

We hold a prior $p (θ)$ over a latent quantity of interest $θ$ (model parameters, a function optimum, a future prediction — anything) and a model $p (y ∣ θ, ξ)$ for the outcome $y$ of an experiment run under design $ξ$ . After observing $y$ , Bayes’ rule updates us to the posterior $p (θ ∣ y, ξ) \propto p (θ) p (y ∣ θ, ξ)$ .

The information gain of a particular realized outcome is the drop in Shannon entropy from prior to posterior. Before running the experiment we do not know $y$ , so to score a design we take the expectation over outcomes — giving the EIG.

Main Content

Definition: Information Gain (Rainforth 2023, Eq. 1)

For a hypothetical outcome $y$ under design $ξ$ , the information gain in $θ$ is the reduction in Shannon entropy $H [\cdot]$ from the prior to the posterior:
$InfoGain_{θ} (ξ, y) := H [p (θ)] - H [p (θ ∣ y, ξ)] = E_{p (θ ∣ y, ξ)} [lo g p (θ ∣ y, ξ)] - E_{p (θ)} [lo g p (θ)]$
Because $y$ is unknown at design time, this cannot be optimized directly.

Definition: Expected Information Gain (Rainforth 2023, Eqs. 2–3)

The EIG averages information gain over outcomes via the marginal predictive $p (y ∣ ξ) := E_{p (θ)} [p (y ∣ θ, ξ)]$ :
$EIG (ξ) := E_{p (y ∣ ξ)} [InfoGain_{θ} (ξ, y)] = E_{p (θ) p (y ∣ θ, ξ)} [lo g p (θ ∣ y, ξ) - lo g p (θ)]$
The Bayesian optimal design is $ξ^{\*} := ar g max_{ξ \in Ξ} EIG (ξ)$ .

Equivalent forms of the EIG (mutual information)

The EIG can be written four equivalent ways, each suggesting a different estimator. Writing the joint as $p (θ, y ∣ ξ) = p (θ) p (y ∣ θ, ξ)$ :
$EIG (ξ) = E_{p (y, θ ∣ ξ)} [lo g \frac{p ( θ ∣ y , ξ )}{p ( θ )}] = E_{p (y, θ ∣ ξ)} [lo g \frac{p ( θ , y ∣ ξ )}{p ( θ ) p ( y ∣ ξ )}] = E_{p (y, θ ∣ ξ)} [lo g \frac{p ( y ∣ θ , ξ )}{p ( y ∣ ξ )}]$
The middle form shows the EIG is the mutual information $MI_{ξ} (θ; y)$ between latent and outcome. The right form (a “likelihood” form) is convenient when $dim (y) ≪ dim (θ)$ ; the left (“posterior”) form when $dim (θ) ≪ dim (y)$ .

Why the EIG is hard: double intractability

Every form contains an intractable normalizing density:

the posterior $p (θ ∣ y, ξ)$ (left form), and/or
the marginal likelihood $p (y ∣ ξ)$ (right form),

neither of which is generally available in closed form. A naive Monte Carlo estimator of, e.g., the likelihood form,

EIG (ξ) \approx \frac{1}{N} n \sum lo g p (y_{n} ∣ θ_{n}, ξ) - lo g p (y_{n} ∣ ξ),

fails because each $lo g p (y_{n} ∣ ξ)$ is itself an intractable integral. This makes the EIG a nested (doubly-intractable) expectation requiring nested estimation, whose conventional estimators converge slowly ( $O (T^{- 1/3})$ ). Overcoming this is the entire technical program of variational EIG estimation and the unified gradient approach.

Decision-theoretic reading

The EIG is the expected utility of an experiment when utility is the information / log-score utility $U (ξ, θ, y) = lo g p (θ ∣ y, ξ)$ . More general BED replaces this with any utility that is a functional of the posterior (Bernardo 1979; Chaloner & Verdinelli 1995) — but the KL/entropy utility is the most common and typically best-performing choice. See Information-Theoretic Design Objectives and Decision Analysis.

Examples

Discrete-outcome (Rao–Blackwellized) EIG

When $y$ takes a small number of discrete values $Y$ , the inner marginal can be enumerated rather than sampled, giving a lower-variance estimator (Rainforth 2023, Eq. 6):
$\overset{μ}{^}_{N} := y \in Y \sum \frac{1}{N} n = 1 \sum N p (y ∣ θ_{n}, ξ) lo g p (y ∣ θ_{n}, ξ) - \overset{p}{^} (y ∣ ξ) lo g \overset{p}{^} (y ∣ ξ), \overset{p}{^} (y ∣ ξ) = \frac{1}{N} n \sum p (y ∣ θ_{n}, ξ) .$
This is exactly the trick used for the death-process experiment in High-Dimensional Design Applications.

Connections

Generalizes to sequential settings via the incremental EIG conditioned on history — see Sequential and Adaptive BED.
Is a mutual information, so any MI lower/upper bound (Barber–Agakov, InfoNCE/PCE, NWJ, MINE) becomes an EIG estimator — the basis of Variational BOED - Overview and Adaptive Contrastive Estimation (ACE).
Contrasts with the Fisher-information / alphabetic-optimality criteria of classical design, which approximate or replace the EIG — see Information-Theoretic Design Objectives.

Second Brain

Explorer

Expected Information Gain

Expected Information Gain

Overview

Main Content

Why the EIG is hard: double intractability

Decision-theoretic reading

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks