Lindley’s Information Measure

Summary

Lindley (1956), On a Measure of the Information Provided by an Experiment (Annals of Mathematical Statistics). The founding paper of Bayesian experimental design. Adapting Shannon’s information theory to statistics, Lindley defines the average information provided by an experiment as the expected reduction in (negative-)entropy of the parameter from prior to posterior — exactly the modern expected information gain (Expected Information Gain) / mutual information . He proves it is non-negative (Thm 1), additive (Thm 2), sufficiency-invariant, concave/diminishing in repeated sampling (Thms 3–6), defines a prior-free partial order on experiments (more informative for all priors), relates it to Blackwell’s ordering (Thm 9), and applies it to design: perform the experiment of greatest expected information, and continue until a preassigned amount is attained — recovering the Wald SPRT and a determinant (D-optimality) criterion.

Overview

Shannon introduced two ideas: that information is a statistical concept (defined relative to a frequency distribution), and that there is an essentially unique functional of that distribution measuring its information content. Lindley’s purpose is to apply these to an experiment rather than a message: replace the transmitted message by knowledge of parameters before an experiment, and the received message by knowledge after. Comparing the two quantifies the information the experiment provides.

Crucially, Lindley argues that prior distributions — “though usually anathema to the statistician” — are essential to the notion of experimental information: if the prior is concentrated on a single (the state of nature is known), no experiment can be informative. This makes the paper a foundational Bayesian document.

Main Content

The measure

Lindley works with an experiment : outcome space , parameter space , and densities . With prior , the marginal is (Eq. 1) and Bayes’ theorem (Eq. 2).

Definition: Information of a distribution (Lindley 1956, Eqs. 3–5)

The amount of information in the prior is , and after observing , in the posterior, . Sign convention (important): Lindley uses — the negative of Shannon entropy — deliberately reversed, so that a distribution concentrated on one value has maximum information and a spread distribution has less. This is the reverse of the communications engineer’s scale.

Definition 1 & 2: Information provided by an experiment (Lindley 1956, Eqs. 6–7)

The information provided by experiment with prior when the observation is is

The average information provided by the experiment (Definition 2) averages over outcomes via :

This is exactly the modern expected information gain — see Expected Information Gain.

Equivalent forms (Lindley 1956, Eqs. 8–11)

The average information has several equivalent expressions, including the symmetric joint form

which manifestly shows the symmetry between and (it is the mutual information ) and its invariance under one-to-one reparameterization of . Lindley notes this expression “occurs in Shannon’s theory for the rate of transmission along a channel.” (The single-outcome is not invariant; the average is.)

The core theorems

Theorem 1 — Non-negativity (any experiment is informative on average)

, with equality iff does not depend on (except on a null set). Proof: a convexity (Jensen) inequality on . Interpretation: provided the outcome density varies with , the experiment is informative on average — though a particular surprising can reduce information ( is possible).

Theorem 2 — Additivity / chain rule (Lindley 1956, Eqs. 12–13)

For an experiment yielding , , where is the average information from after has been performed and observed. This is the additivity postulate Shannon required, and underlies the additivity of incremental EIGs in sequential design. Corollary (sufficiency): if is sufficient for (Neyman–Fisher), then — no information is lost by reducing to a sufficient statistic; an insufficient statistic strictly loses information.

Theorems 3–6 — Diminishing returns and concavity

  • Thm 3: for independent experiments, — an independent repeat is less informative performed second than first (diminishing marginal utility of equidistributed observations). Hence for independent experiments.
  • Thm 4: for independent identical repetitions, is a concave, increasing function of .
  • Thm 5: is a concave function of the prior .
  • Thm 6: for a mixture experiment (sample from ), convex in the outcome densities: it is better to take a fixed sample size than to “mix” sample sizes of the same average.

Comparing experiments without a prior (the partial order)

Definition 4 — "More informative (S)" (Lindley 1956, Eq. 17)

is more informative than (written ) if for all priors , strictly for some. This is a partial order: there exist pairs comparable under no single criterion, judgeable only with a specific prior.

Theorem 9 — Relation to Blackwell's ordering

If is sufficient for in Blackwell’s sense ( — an experimenter with can reproduce by a random device), then is not less informative (S): . So Blackwell’s ordering implies Lindley’s, but the converse is false (demonstrated by the binomial-dichotomy example, Fig. 1: there are experiments more informative in Lindley’s sense that are not Blackwell-comparable). Lindley’s ordering is coarser — more pairs are comparable — which he views as a “satisfactory feature.”

Lindley contrasts his information ordering with the decision-theoretic comparisons of Blackwell and of Bohnenblust–Shapley–Sherman (loss-function based), arguing that gaining knowledge about nature is a legitimate purpose of experimentation distinct from reaching decisions.

The design principle

The Lindley design rule (§2, §6)

“Perform that experiment for which the expected gain in information is the greatest, and continue experimentation until a preassigned amount of information has been attained.”

This is the seed of all of Bayesian optimal experimental design: maximize the EIG to choose designs, and use an information threshold as a stopping rule for sequential experiments.

Examples

Normal experiment — smaller variance is more informative (§6, Eq. 19)

For with , Lindley shows when (smaller-variance experiment more informative for all priors), via the stochastic transformation . With a normal prior :

The -sample formula illustrates Theorem 4 (concave, increasing) and, notably, grows without limit — contrast with the usual precision measure .

Multivariate normal — the determinant (D-optimality) criterion (§6)

For -dimensional with prior :

Under near-ignorance ( small), comparison reduces approximately to — i.e. design via the determinant of the dispersion matrix, the criterion later known as D-optimality (Lindley notes the determinant criterion was used by Wald). This is the historical bridge from Lindley’s measure to classical alphabetic optimality.

Sequential sampling = the Wald SPRT (§6, Eqs. 20–21)

Stopping when the posterior information first reaches a threshold recovers classical sequential tests. For a binomial dichotomy , the stopping rule “continue while ” is exactly a Wald sequential probability ratio test of vs :

For the normal experiment, the optimum sequential scheme is of fixed sample size ; for repeated binomial trials with a Beta prior family (Eq. 22), the stopping boundary is approximately (Fig. 2) — formalizing the “common-sense” intuition that repeatedly observing the same outcome (e.g. “the sun rises”) accumulates more information than a mixture.

Connections

See Also