NOTEARS Algorithm
Summary
Given the smooth acyclicity function (see Smooth Characterization of Acyclicity), learning a DAG becomes the equality-constrained program (ECP) , solved by the augmented Lagrangian method. The algorithm has three pieces: (i) convert the constrained problem into a sequence of unconstrained subproblems via a quadratic penalty + dual ascent; (ii) solve each subproblem with L-BFGS / proximal quasi-Newton; (iii) threshold the solution to round small weights to zero. Typically fewer than 10 augmented-Lagrangian iterations are needed.
Overview
Theorem 1 turns the combinatorial DAG constraint into a smooth equality constraint. NOTEARS then treats DAG learning as a classical equality-constrained optimization problem and solves it to stationarity (not global optimality — the program is nonconvex). The whole method is deliberately built from off-the-shelf solver components.
Main Content
The equality-constrained program (ECP)
Program (9): ECP (NOTEARS §4)
Equivalent to program (3). Its advantage: amenable to classical constrained-optimization techniques. But is a nonconvex set, so (9) inherits the difficulties of nonconvex optimization — NOTEARS settles for stationary points of (9).
Augmented Lagrangian
Augmented Lagrangian (NOTEARS §4.1, Eqs. 10-12)
The augmented Lagrangian augments the objective with a quadratic penalty () plus the Lagrange-multiplier term:
The dual function and dual problem are
Why augmented Lagrangian (vs. plain quadratic penalty)
A key property ([Nemirovski, 1999]): the augmented Lagrangian approximates the constrained solution well without driving the penalty (which would make subproblems ill-conditioned). It is essentially a dual ascent scheme for the penalized problem.
Dual ascent update
Let be the local minimizer at , so . Since is linear in , its derivative is simply . Hence dual gradient ascent:
Dual update (NOTEARS §4.1, Eq. 14)
Proposition 3: Convergence rate (NOTEARS §4.1; Cor. 11.2.1, Nemirovski 1999)
For large enough and starting point near the solution , the update (14) converges to linearly. In experiments, typically fewer than 10 augmented-Lagrangian steps are required.
Solving the unconstrained subproblem
Each subproblem is, writing with :
Subproblem (NOTEARS §4.2, Eqs. 15-16)
where is the smooth part of the objective.
Two regimes:
- (no sparsity): the problem is a smooth unconstrained minimization, solved by L-BFGS ([Byrd et al., 1995; Nocedal & Wright, 2006]). A slight modification (Nocedal & Wright, Procedure 18.2) handles the nonconvexity.
- : a composite (smooth + ) problem solved by proximal quasi-Newton (PQN) ([Zhong et al., 2014]). At step , find a descent direction via a quadratic model of the smooth part: where and is the L-BFGS approximation of the Hessian. Each coordinate has a closed-form update via soft-thresholding :
Efficiency from low-rank structure
The low-rank structure of the L-BFGS Hessian enables fast coordinate updates: precomputation is where is the L-BFGS memory size, and each coordinate update is . Aggressively shrinking the active set of coordinates by subgradient makes all dependencies become . Overall L-BFGS update cost: , with inner iterations .
Thresholding
Hard thresholding (NOTEARS §4.3)
After obtaining a stationary point of (10), given a threshold , set any weight smaller than in absolute value to zero:
Why thresholding works here
Numerically, the solution satisfies for a small tolerance (e.g. ) rather than exactly. Because quantifies DAG-ness (desideratum (b)), a small threshold suffices to remove the tiny residual cycle-inducing edges and “round” the solution to an exact DAG. Hard thresholding also provably reduces false discoveries in regression ([Zhou, 2009; Wang et al., 2016]).
The full algorithm
Algorithm 1: NOTEARS (NOTEARS §4)
Input: initial guess , progress rate , tolerance , threshold .
For :
- (Primal) with chosen so that (i.e. force a -factor reduction in the constraint violation).
- (Dual ascent) .
- (Stop) If , set and break.
Return the thresholded matrix .
Connections
- Depends on the gradient from Smooth Characterization of Acyclicity — fed into L-BFGS/PQN.
- Global vs. local search: each primal step updates the entire matrix , avoiding the edge-at-a-time assumptions of local search (see DAG Structure Learning Problem).
- Nonconvexity: the method finds stationary points; NOTEARS Experiments empirically checks how close these are to the global minimizer.
- Matrix-exponential cost per evaluation motivates the use of second-order methods (L-BFGS/PQN) to reduce the number of evaluations.
See Also
- Smooth Characterization of Acyclicity — supplies and
- NOTEARS Experiments — empirical validation of the algorithm
- NOTEARS - Overview — paper-level context