Lindley’s Information Measure

Summary

Lindley (1956), On a Measure of the Information Provided by an Experiment (Annals of Mathematical Statistics). The founding paper of Bayesian experimental design. Adapting Shannon’s information theory to statistics, Lindley defines the average information provided by an experiment as the expected reduction in (negative-)entropy of the parameter $θ$ from prior to posterior — exactly the modern expected information gain (Expected Information Gain) / mutual information $I (θ; x)$ . He proves it is non-negative (Thm 1), additive (Thm 2), sufficiency-invariant, concave/diminishing in repeated sampling (Thms 3–6), defines a prior-free partial order on experiments (more informative for all priors), relates it to Blackwell’s ordering (Thm 9), and applies it to design: perform the experiment of greatest expected information, and continue until a preassigned amount is attained — recovering the Wald SPRT and a determinant (D-optimality) criterion.

Overview

Shannon introduced two ideas: that information is a statistical concept (defined relative to a frequency distribution), and that there is an essentially unique functional of that distribution measuring its information content. Lindley’s purpose is to apply these to an experiment rather than a message: replace the transmitted message $x$ by knowledge of parameters $θ$ before an experiment, and the received message by knowledge after. Comparing the two quantifies the information the experiment provides.

Crucially, Lindley argues that prior distributions — “though usually anathema to the statistician” — are essential to the notion of experimental information: if the prior is concentrated on a single $θ$ (the state of nature is known), no experiment can be informative. This makes the paper a foundational Bayesian document.

Main Content

The measure

Lindley works with an experiment $E = {X, B, Θ, P}$ : outcome space $X$ , parameter space $Θ$ , and densities $p (x ∣ θ)$ . With prior $p (θ)$ , the marginal is $p (x) = \int p (x ∣ θ) p (θ) d θ$ (Eq. 1) and Bayes’ theorem $p (θ ∣ x) = p (x ∣ θ) p (θ) / p (x)$ (Eq. 2).

Definition: Information of a distribution (Lindley 1956, Eqs. 3–5)

The amount of information in the prior is $I_{0} = \int p (θ) lo g p (θ) d θ = E_{θ} lo g p (θ)$ , and after observing $x$ , in the posterior, $I_{1} (x) = \int p (θ ∣ x) lo g p (θ ∣ x) d θ$ . Sign convention (important): Lindley uses $+ \int p lo g p$ — the negative of Shannon entropy — deliberately reversed, so that a distribution concentrated on one value has maximum information and a spread distribution has less. This is the reverse of the communications engineer’s scale.

Definition 1 & 2: Information provided by an experiment (Lindley 1956, Eqs. 6–7)

The information provided by experiment $E$ with prior $p (θ)$ when the observation is $x$ is
$I (E, p (θ), x) = I_{1} (x) - I_{0} .$
The average information provided by the experiment (Definition 2) averages over outcomes via $p (x)$ :
$I (E, p (θ)) = E_{x} [I_{1} (x) - I_{0}] = E_{x} [\int p (θ ∣ x) lo g p (θ ∣ x) d θ] - \int p (θ) lo g p (θ) d θ .$
This is exactly the modern expected information gain $H [p (θ)] - E_{x} H [p (θ ∣ x)]$ — see Expected Information Gain.

Equivalent forms (Lindley 1956, Eqs. 8–11)

The average information has several equivalent expressions, including the symmetric joint form
$I (E, p (θ)) = \iint p (x, θ) lo g \frac{p ( x , θ )}{p ( x ) p ( θ )} d x d θ = E_{θ} E_{x} lo g \frac{p ( θ ∣ x )}{p ( θ )} = E_{x} E_{θ} lo g \frac{p ( x ∣ θ )}{p ( x )},$
which manifestly shows the symmetry between $x$ and $θ$ (it is the mutual information $I (θ; x)$ ) and its invariance under one-to-one reparameterization of $Θ$ . Lindley notes this expression “occurs in Shannon’s theory for the rate of transmission along a channel.” (The single-outcome $I_{1} (x) - I_{0}$ is not invariant; the average is.)

The core theorems

Theorem 1 — Non-negativity (any experiment is informative on average)

$I (E) \geq 0$ , with equality iff $p (x ∣ θ)$ does not depend on $θ$ (except on a null set). Proof: a convexity (Jensen) inequality on $x lo g x$ . Interpretation: provided the outcome density varies with $θ$ , the experiment is informative on average — though a particular surprising $x$ can reduce information ( $I_{1} (x) - I_{0} < 0$ is possible).

Theorem 2 — Additivity / chain rule (Lindley 1956, Eqs. 12–13)

For an experiment yielding $x = (x_{1}, x_{2})$ , $I (E_{1}) + I (E_{2} ∣ E_{1}) = I (E)$ , where $I (E_{2} ∣ E_{1})$ is the average information from $x_{2}$ after $E_{1}$ has been performed and $x_{1}$ observed. This is the additivity postulate Shannon required, and underlies the additivity of incremental EIGs in sequential design. Corollary (sufficiency): if $x_{1}$ is sufficient for $x$ (Neyman–Fisher), then $I (E_{1}) = I (E)$ — no information is lost by reducing to a sufficient statistic; an insufficient statistic strictly loses information.

Theorems 3–6 — Diminishing returns and concavity

Thm 3: for independent experiments, $I (E_{2} ∣ E_{1}) \leq I (E_{2})$ — an independent repeat is less informative performed second than first (diminishing marginal utility of equidistributed observations). Hence $I (E_{1}) + I (E_{2}) \geq I (E)$ for independent experiments.

Thm 4: for $n$ independent identical repetitions, $j_{n} := I (E^{(n)})$ is a concave, increasing function of $n$ .

Thm 5: $I (E, p (θ))$ is a concave function of the prior $p (θ)$ .

Thm 6: for a mixture experiment (sample from $λ p_{1} + (1 - λ) p_{2}$ ), $I (E) \leq λ I (E_{1}) + (1 - λ) I (E_{2})$ — convex in the outcome densities: it is better to take a fixed sample size than to “mix” sample sizes of the same average.

Comparing experiments without a prior (the partial order)

Definition 4 — "More informative (S)" (Lindley 1956, Eq. 17)

$E_{1}$ is more informative than $E_{2}$ (written $E_{1} > E_{2}$ ) if $I (E_{1}, p (θ)) \geq I (E_{2}, p (θ))$ for all priors $p (θ)$ , strictly for some. This is a partial order: there exist pairs comparable under no single criterion, judgeable only with a specific prior.

Theorem 9 — Relation to Blackwell's ordering

If $E_{1}$ is sufficient for $E_{2}$ in Blackwell’s sense ( $E_{1} \supset E_{2}$ — an experimenter with $E_{1}$ can reproduce $E_{2}$ by a random device), then $E_{1}$ is not less informative (S): $E_{1} \geq E_{2}$ . So Blackwell’s ordering implies Lindley’s, but the converse is false (demonstrated by the binomial-dichotomy example, Fig. 1: there are experiments more informative in Lindley’s sense that are not Blackwell-comparable). Lindley’s ordering is coarser — more pairs are comparable — which he views as a “satisfactory feature.”

Lindley contrasts his information ordering with the decision-theoretic comparisons of Blackwell and of Bohnenblust–Shapley–Sherman (loss-function based), arguing that gaining knowledge about nature is a legitimate purpose of experimentation distinct from reaching decisions.

The design principle

The Lindley design rule (§2, §6)

“Perform that experiment for which the expected gain in information is the greatest, and continue experimentation until a preassigned amount of information has been attained.”

This is the seed of all of Bayesian optimal experimental design: maximize the EIG to choose designs, and use an information threshold as a stopping rule for sequential experiments.

Examples

Normal experiment — smaller variance is more informative (§6, Eq. 19)

For $E (σ)$ with $x \sim N (θ, σ^{2})$ , Lindley shows $E (σ_{1}) > E (σ_{2})$ when $σ_{1} < σ_{2}$ (smaller-variance experiment more informative for all priors), via the stochastic transformation $x_{2}^{'} = x_{1} + u$ . With a normal prior $N (μ, τ^{2})$ :
$I (E (σ), p_{τ}) = \frac{1}{2} lo g (1 + τ^{2} / σ^{2}), j_{n} = \frac{1}{2} lo g (1 + n τ^{2} / σ^{2}) .$
The $n$ -sample formula illustrates Theorem 4 (concave, increasing) and, notably, $j_{n}$ grows without limit — contrast with the usual precision measure $n / σ^{2}$ .

Multivariate normal — the determinant (D-optimality) criterion (§6)

For $k$ -dimensional $x \sim N (θ, C)$ with prior $θ \sim N (μ, A)$ :
$I (E (C), p_{A}) = \frac{1}{2} lo g \frac{∣ A + C ∣}{∣ C ∣} .$
Under near-ignorance ( $A^{- 1} C$ small), comparison reduces approximately to $∣ C_{2} ∣ > ∣ C_{1} ∣$ — i.e. design via the determinant of the dispersion matrix, the criterion later known as D-optimality (Lindley notes the determinant criterion was used by Wald). This is the historical bridge from Lindley’s measure to classical alphabetic optimality.

Sequential sampling = the Wald SPRT (§6, Eqs. 20–21)

Stopping when the posterior information first reaches a threshold $δ$ recovers classical sequential tests. For a binomial dichotomy $Θ = {θ_{1}, θ_{2}}$ , the stopping rule “continue while $1 - A < p_{n} (θ_{1}) < A$ ” is exactly a Wald sequential probability ratio test of $θ_{1}$ vs $θ_{2}$ :
$\frac{1 - A}{A} \frac{p ( θ _{2} )}{p ( θ _{1} )} < \frac{p ( x _{1} , \dots , x _{n} ∣ θ _{1} )}{p ( x _{1} , \dots , x _{n} ∣ θ _{2} )} < \frac{A}{1 - A} \frac{p ( θ _{2} )}{p ( θ _{1} )} .$
For the normal experiment, the optimum sequential scheme is of fixed sample size $n \geq (2 π e σ^{2} τ^{2} - σ^{2} e^{- 2 δ}) / (τ^{2} e^{- 2 δ})$ ; for repeated binomial trials with a Beta prior family $p_{ab} (θ) \propto θ^{a - 1} (1 - θ)^{b - 1}$ (Eq. 22), the stopping boundary is approximately $(a + b)^{3} = λab$ (Fig. 2) — formalizing the “common-sense” intuition that repeatedly observing the same outcome (e.g. “the sun rises”) accumulates more information than a mixture.

Connections

Originates the EIG objective that every modern method (Foster 2019, Foster 2020, Rainforth 2023) estimates and optimizes. The three reviewed papers all cite this as the foundational reference.
Theorem 2 (additivity) is the ancestor of the additive incremental/total EIG in Sequential and Adaptive BED and deep adaptive design.
The determinant criterion prefigures the Fisher-information/alphabetic-optimality view contrasted in Information-Theoretic Design Objectives.
Builds on Shannon (1948) and the Kullback–Leibler (1951) information; relates to Blackwell’s Comparison of Experiments (1951–53).

Second Brain

Explorer

Lindley's Information Measure

Lindley’s Information Measure

Overview

Main Content

The measure

The core theorems

Comparing experiments without a prior (the partial order)

The design principle

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks