NOTEARS Algorithm

Summary

Given the smooth acyclicity function $h$ (see Smooth Characterization of Acyclicity), learning a DAG becomes the equality-constrained program (ECP) $min_{W} F (W) s.t. h (W) = 0$ , solved by the augmented Lagrangian method. The algorithm has three pieces: (i) convert the constrained problem into a sequence of unconstrained subproblems via a quadratic penalty + dual ascent; (ii) solve each subproblem with L-BFGS / proximal quasi-Newton; (iii) threshold the solution to round small weights to zero. Typically fewer than 10 augmented-Lagrangian iterations are needed.

Overview

Theorem 1 turns the combinatorial DAG constraint into a smooth equality constraint. NOTEARS then treats DAG learning as a classical equality-constrained optimization problem and solves it to stationarity (not global optimality — the program is nonconvex). The whole method is deliberately built from off-the-shelf solver components.

Main Content

The equality-constrained program (ECP)

Program (9): ECP (NOTEARS §4)

$(ECP) W \in R^{d \times d} min F (W) subject to h (W) = 0.$
Equivalent to program (3). Its advantage: amenable to classical constrained-optimization techniques. But ${W : h (W) = 0}$ is a nonconvex set, so (9) inherits the difficulties of nonconvex optimization — NOTEARS settles for stationary points of (9).

Augmented Lagrangian

Augmented Lagrangian (NOTEARS §4.1, Eqs. 10-12)

The augmented Lagrangian augments the objective with a quadratic penalty $\frac{ρ}{2} ∣ h (W) ∣^{2}$ ( $ρ > 0$ ) plus the Lagrange-multiplier term:
$L^{ρ} (W, α) = F (W) + \frac{ρ}{2} ∣ h (W) ∣^{2} + α h (W) .$
The dual function and dual problem are
$D (α) = W \in R^{d \times d} min L^{ρ} (W, α), α \in R max D (α) .$

Why augmented Lagrangian (vs. plain quadratic penalty)

A key property ([Nemirovski, 1999]): the augmented Lagrangian approximates the constrained solution well without driving the penalty $ρ \to \infty$ (which would make subproblems ill-conditioned). It is essentially a dual ascent scheme for the penalized problem.

Dual ascent update

Let $W_{α}^{⋆} = ar g min_{W} L^{ρ} (W, α)$ be the local minimizer at $α$ , so $D (α) = L^{ρ} (W_{α}^{⋆}, α)$ . Since $D (α)$ is linear in $α$ , its derivative is simply $\nabla D (α) = h (W_{α}^{⋆})$ . Hence dual gradient ascent:

Dual update (NOTEARS §4.1, Eq. 14)

$α \leftarrow α + ρ h (W_{α}^{⋆}) .$

Proposition 3: Convergence rate (NOTEARS §4.1; Cor. 11.2.1, Nemirovski 1999)

For $ρ$ large enough and starting point $α_{0}$ near the solution $α^{⋆}$ , the update (14) converges to $α^{⋆}$ linearly. In experiments, typically fewer than 10 augmented-Lagrangian steps are required.

Solving the unconstrained subproblem

Each subproblem $min_{W} L^{ρ} (W, α)$ is, writing $w = vec (W) \in R^{p}$ with $p = d^{2}$ :

Subproblem (NOTEARS §4.2, Eqs. 15-16)

$w \in R^{p} min f (w) + λ ∥ w ∥_{1}, f (w) = ℓ (W; X) + \frac{ρ}{2} ∣ h (W) ∣^{2} + α h (W),$
where $f$ is the smooth part of the objective.

Two regimes:

$λ = 0$ (no sparsity): the problem is a smooth unconstrained minimization, solved by L-BFGS ([Byrd et al., 1995; Nocedal & Wright, 2006]). A slight modification (Nocedal & Wright, Procedure 18.2) handles the nonconvexity.
$λ > 0$ : a composite (smooth + $ℓ_{1}$ ) problem solved by proximal quasi-Newton (PQN) ([Zhong et al., 2014]). At step $k$ , find a descent direction via a quadratic model of the smooth part: $d_{k} = ar g d \in R^{p} min g_{k}^{T} d + \frac{1}{2} d^{T} B_{k} d + λ ∥ w_{k} + d ∥_{1},$ where $g_{k} = \nabla f (w_{k})$ and $B_{k}$ is the L-BFGS approximation of the Hessian. Each coordinate $j$ has a closed-form update $d \leftarrow d + z^{⋆} e_{j}$ via soft-thresholding $S (\cdot)$ : $z^{⋆} = ar g z min \frac{1}{2} a B_{jj} z^{2} + b (g_{j} + (B d)_{j}) z + λ c ∣ w_{j} + d_{j} + z ∣ = - d + S (c - \frac{b}{a}, \frac{λ}{a}) .$

Efficiency from low-rank structure

The low-rank structure of the L-BFGS Hessian $B_{k}$ enables fast coordinate updates: precomputation is $O (m^{2} p + m^{3})$ where $m ≪ p$ is the L-BFGS memory size, and each coordinate update is $O (m)$ . Aggressively shrinking the active set $S$ of coordinates by subgradient makes all $O (p)$ dependencies become $O (∣ S ∣)$ . Overall L-BFGS update cost: $O (m^{2} ∣ S ∣ + m^{3} + m ∣ S ∣ T)$ , with inner iterations $T \approx 10$ .

Thresholding

Hard thresholding (NOTEARS §4.3)

After obtaining a stationary point $W_{ECP}$ of (10), given a threshold $ω > 0$ , set any weight smaller than $ω$ in absolute value to zero:
$W := W_{ECP} \circ 1 (∣ W_{ECP} ∣ > ω) .$

Why thresholding works here

Numerically, the solution satisfies $h (W_{ECP}) \leq ϵ$ for a small tolerance (e.g. $ϵ = 1 0^{- 8}$ ) rather than $= 0$ exactly. Because $h$ quantifies DAG-ness (desideratum (b)), a small threshold $ω$ suffices to remove the tiny residual cycle-inducing edges and “round” the solution to an exact DAG. Hard thresholding also provably reduces false discoveries in regression ([Zhou, 2009; Wang et al., 2016]).

The full algorithm

Algorithm 1: NOTEARS (NOTEARS §4)

Input: initial guess $(W_{0}, α_{0})$ , progress rate $c \in (0, 1)$ , tolerance $ϵ > 0$ , threshold $ω > 0$ .

For $t = 0, 1, 2, \dots$ :

(Primal) $W_{t + 1} \leftarrow ar g min_{W} L^{ρ} (W, α_{t})$ with $ρ$ chosen so that $h (W_{t + 1}) < c h (W_{t})$ (i.e. force a $c$ -factor reduction in the constraint violation).

(Dual ascent) $α_{t + 1} \leftarrow α_{t} + ρ h (W_{t + 1})$ .

(Stop) If $h (W_{t + 1}) < ϵ$ , set $W_{ECP} = W_{t + 1}$ and break.

Return the thresholded matrix $W := W_{ECP} \circ 1 (∣ W_{ECP} ∣ > ω)$ .

Connections

Depends on the gradient $\nabla h (W) = (e^{W \circ W})^{T} \circ 2 W$ from Smooth Characterization of Acyclicity — fed into L-BFGS/PQN.
Global vs. local search: each primal step updates the entire matrix $W$ , avoiding the edge-at-a-time assumptions of local search (see DAG Structure Learning Problem).
Nonconvexity: the method finds stationary points; NOTEARS Experiments empirically checks how close these are to the global minimizer.
Matrix-exponential cost $O (d^{3})$ per evaluation motivates the use of second-order methods (L-BFGS/PQN) to reduce the number of $h$ evaluations.

Second Brain

Explorer

NOTEARS Algorithm

NOTEARS Algorithm

Overview

Main Content

The equality-constrained program (ECP)

Augmented Lagrangian

Dual ascent update

Solving the unconstrained subproblem

Thresholding

The full algorithm

Connections

See Also

Graph View

Table of Contents

Backlinks