Prior Contrastive Estimation (PCE)

Summary

Prior Contrastive Estimation (PCE) is the simplification of ACE that draws the contrastive samples from the prior $p (θ)$ instead of a learned inference network $q_{ϕ}$ . This removes the need to learn any variational parameters — there is no $ϕ$ — so it is cheaper and simpler to train, at the cost of only being tight as $L \to \infty$ (it loses ACE’s “good network” route to tightness). PCE is essentially the InfoNCE mutual-information bound applied to experimental design.

Overview

ACE’s inference network $q_{ϕ} (θ ∣ y)$ must be learned, adding parameters and training cost. If the prior is already a reasonable proposal for estimating the marginal $p (y ∣ ξ)$ , we can skip the network and use prior draws as contrasts. The bound then depends only on the design $ξ$ , so optimization is a pure design problem.

Main Content

Definition: PCE lower bound (Foster 2020, Eq. 12)

With $θ_{0} \sim p (θ)$ , $y \sim p (y ∣ θ_{0}, ξ)$ , and contrastive samples $θ_{1 : L} \sim p (θ)$ drawn from the prior:
$I_{PCE} (ξ, L) := E [lo g \frac{p ( y ∣ θ _{0} , ξ )}{\frac{1}{L + 1} \sum _{ℓ = 0}^{L} p ( y ∣ θ _{ℓ} , ξ )}],$
expectation over $p (θ_{0}) p (y ∣ θ_{0}, ξ) p (θ_{1 : L})$ . This is the $q_{ϕ} = p (θ)$ special case of ACE, so it inherits Theorem 1: it is a valid lower bound, monotone in $L$ , and tight as $L \to \infty$ (but only case-2 tightness — no “perfect network” route).

Connection to InfoNCE (Foster 2020, Eq. 13)

The InfoNCE / information-noise-contrastive-estimation bound from representation learning (van den Oord et al. 2018) is, for data $x_{k}$ , representations $z_{k}$ , and critic $f_{ψ} (x, z) \geq 0$ :
$MI (x; z) \geq E [\frac{1}{K} k = 1 \sum K lo g \frac{f _{ψ} ( x _{k} , z _{k} )}{\frac{1}{K} \sum _{ℓ = 1}^{K} f _{ψ} ( x _{ℓ} , z _{k} )}] .$
Writing $θ$ for $x$ and $y$ for $z$ , PCE is the case where the optimal critic $p (z ∣ x) = p (y ∣ θ, ξ)$ is known (it is the likelihood) — so PCE is the experimental-design instance of InfoNCE with a known critic.

Unnormalized prior densities

A practical bonus: PCE (and ACE) only need the prior up to proportionality. If $p (θ) = A \cdot γ (θ)$ with $A$ independent of $(ξ, ϕ, y)$ and $γ$ an unnormalized density, then (Foster 2020, Eq. 15)

I (ξ) \geq E lo g \frac{p ( y ∣ θ _{0} , ξ )}{\frac{1}{L + 1} \sum _{ℓ = 0}^{L} \frac{γ ( θ _{ℓ} ) p ( y ∣ θ _{ℓ} , ξ )}{q _{ϕ} ( θ _{ℓ} ∣ y )}} - lo g A,

and the derivatives of $lo g A$ vanish. This matters in iterated design, where the prior at step $t$ is the previous posterior $p (θ ∣ y_{1 : t - 1}, ξ_{1 : t - 1})$ , known only up to its normalizing constant.

When to prefer PCE vs ACE

PCE: prior is an adequate proposal for $p (y ∣ ξ)$ ; no variational training wanted; low-to-moderate dimension. PCE performed well in low dimensions but degraded as dimension increased (the prior becomes an inefficient proposal).
ACE: the inference network can closely approximate the posterior, or sampling from the prior is inefficient (high dimension) — ACE/BA learn adaptive proposals and avoid the under-estimation.

Connections

Special case of Adaptive Contrastive Estimation (ACE) (set $q_{ϕ} = p (θ)$ ).
InfoNCE / NCE lineage: ties BOED to contrastive representation learning and noise-contrastive estimation.
In Rainforth 2023, PCE is one of the “contrastive bounds” cited as enabling consistent stochastic-gradient design optimization (their §3.3.1).

Second Brain

Explorer

Prior Contrastive Estimation (PCE)

Prior Contrastive Estimation (PCE)

Overview

Main Content

Unnormalized prior densities

When to prefer PCE vs ACE

Connections

See Also

Graph View

Table of Contents

Backlinks