Adaptive Contrastive Estimation (ACE)

Summary

Adaptive Contrastive Estimation (ACE) is Foster 2020’s recommended EIG lower bound. It augments the Barber–Agakov bound with contrastive samples in the denominator, alongside the original sample from which was drawn. The resulting bound is tight in two complementary regimes — when the inference network is good, or when — and is optimized jointly over design and parameters by stochastic gradient ascent.

Overview

The Barber–Agakov bound is tight only if the inference network can represent the true posterior. When it cannot, is loose. ACE fixes this adaptively: it borrows VNMC’s idea of contrastive/importance samples, but arranges them to keep a valid lower bound (VNMC is an upper bound, unusable for joint maximization over ). Including the original sample in the denominator prevents the catastrophic under-estimation of that pure contrastive samples would cause.

Main Content

Definition: ACE lower bound (Foster 2020, Eq. 11)

With , , and contrastive samples :

expectation over . The denominator is a self-normalized importance estimate of the marginal using the contrasts plus .

Theorem 1 — Properties of ACE (Foster 2020)

For any model and inference network :

  1. Lower bound with KL error: , where .
  2. Asymptotic exactness: .
  3. Monotone in : for .
  4. Exact with perfect network: if then for all .

Why ACE beats BA

(Foster 2020, Eq. 7) is recovered as the case. ACE adds a second route to tightness: even a poor gives an accurate bound if is large (property 2). This is the same adaptive-tightening logic as VNMC, but oriented to give a lower bound suitable for joint maximization. Foster 2020 reports that across all five experiments ACE generally does at least as well as the better of BA and PCE, hence the recommendation to use it as the default.

Connection to InfoNCE

ACE generalizes the InfoNCE mutual-information bound of representation learning: with and , the contrastive denominator is the InfoNCE critic. PCE (Prior Contrastive Estimation (PCE)) is the special case using the prior as the contrastive distribution.

Examples

Death process trajectory (Foster 2020 §4.2, Figs. 1–2)

On the 2-D death-process design (, measure infected counts at two times), ACE’s SGA trajectory climbs the known EIG surface to the optimum, reaching final EIG — beating BA (0.9822), PCE (0.9822), and Bayesian optimization + NMC (0.9732), and converging faster in wall-clock time.

Connections

See Also