NOTEARS Algorithm

Summary

Given the smooth acyclicity function (see Smooth Characterization of Acyclicity), learning a DAG becomes the equality-constrained program (ECP) , solved by the augmented Lagrangian method. The algorithm has three pieces: (i) convert the constrained problem into a sequence of unconstrained subproblems via a quadratic penalty + dual ascent; (ii) solve each subproblem with L-BFGS / proximal quasi-Newton; (iii) threshold the solution to round small weights to zero. Typically fewer than 10 augmented-Lagrangian iterations are needed.

Overview

Theorem 1 turns the combinatorial DAG constraint into a smooth equality constraint. NOTEARS then treats DAG learning as a classical equality-constrained optimization problem and solves it to stationarity (not global optimality — the program is nonconvex). The whole method is deliberately built from off-the-shelf solver components.

Main Content

The equality-constrained program (ECP)

Program (9): ECP (NOTEARS §4)

Equivalent to program (3). Its advantage: amenable to classical constrained-optimization techniques. But is a nonconvex set, so (9) inherits the difficulties of nonconvex optimization — NOTEARS settles for stationary points of (9).

Augmented Lagrangian

Augmented Lagrangian (NOTEARS §4.1, Eqs. 10-12)

The augmented Lagrangian augments the objective with a quadratic penalty () plus the Lagrange-multiplier term:

The dual function and dual problem are

Why augmented Lagrangian (vs. plain quadratic penalty)

A key property ([Nemirovski, 1999]): the augmented Lagrangian approximates the constrained solution well without driving the penalty (which would make subproblems ill-conditioned). It is essentially a dual ascent scheme for the penalized problem.

Dual ascent update

Let be the local minimizer at , so . Since is linear in , its derivative is simply . Hence dual gradient ascent:

Dual update (NOTEARS §4.1, Eq. 14)

Proposition 3: Convergence rate (NOTEARS §4.1; Cor. 11.2.1, Nemirovski 1999)

For large enough and starting point near the solution , the update (14) converges to linearly. In experiments, typically fewer than 10 augmented-Lagrangian steps are required.

Solving the unconstrained subproblem

Each subproblem is, writing with :

Subproblem (NOTEARS §4.2, Eqs. 15-16)

where is the smooth part of the objective.

Two regimes:

  • (no sparsity): the problem is a smooth unconstrained minimization, solved by L-BFGS ([Byrd et al., 1995; Nocedal & Wright, 2006]). A slight modification (Nocedal & Wright, Procedure 18.2) handles the nonconvexity.
  • : a composite (smooth + ) problem solved by proximal quasi-Newton (PQN) ([Zhong et al., 2014]). At step , find a descent direction via a quadratic model of the smooth part: where and is the L-BFGS approximation of the Hessian. Each coordinate has a closed-form update via soft-thresholding :

Efficiency from low-rank structure

The low-rank structure of the L-BFGS Hessian enables fast coordinate updates: precomputation is where is the L-BFGS memory size, and each coordinate update is . Aggressively shrinking the active set of coordinates by subgradient makes all dependencies become . Overall L-BFGS update cost: , with inner iterations .

Thresholding

Hard thresholding (NOTEARS §4.3)

After obtaining a stationary point of (10), given a threshold , set any weight smaller than in absolute value to zero:

Why thresholding works here

Numerically, the solution satisfies for a small tolerance (e.g. ) rather than exactly. Because quantifies DAG-ness (desideratum (b)), a small threshold suffices to remove the tiny residual cycle-inducing edges and “round” the solution to an exact DAG. Hard thresholding also provably reduces false discoveries in regression ([Zhou, 2009; Wang et al., 2016]).

The full algorithm

Algorithm 1: NOTEARS (NOTEARS §4)

Input: initial guess , progress rate , tolerance , threshold .

For :

  1. (Primal) with chosen so that (i.e. force a -factor reduction in the constraint violation).
  2. (Dual ascent) .
  3. (Stop) If , set and break.

Return the thresholded matrix .

Connections

  • Depends on the gradient from Smooth Characterization of Acyclicity — fed into L-BFGS/PQN.
  • Global vs. local search: each primal step updates the entire matrix , avoiding the edge-at-a-time assumptions of local search (see DAG Structure Learning Problem).
  • Nonconvexity: the method finds stationary points; NOTEARS Experiments empirically checks how close these are to the global minimizer.
  • Matrix-exponential cost per evaluation motivates the use of second-order methods (L-BFGS/PQN) to reduce the number of evaluations.

See Also