Skip to content
Matthew Reda
Go back

NO TEARS: How Acyclicity Became Differentiable

Edit page

For most of its history, learning a causal graph from data was a search problem, and a brutal one. You want the directed acyclic graph that best explains your data, but the number of DAGs on dd nodes grows superexponentially — faster than d!d!, faster than 2d22^{d^2}. So the field built clever heuristics: add an edge, check it doesn’t create a cycle, score the result, greedily keep going. PC, GES, FGS — all variations on walking the discrete space one edge at a time, because you can’t enumerate it and you can’t differentiate it.

In 2018, Zheng, Aragam, Ravikumar, and Xing published a paper with a title that tells you exactly how the field felt about this: DAGs with NO TEARS. Their move was to stop searching the discrete space at all. They found a way to write “this graph is acyclic” as a smooth equation, and once you can do that, structure learning becomes continuous optimization. You can run a gradient-based solver. The whole thing is about fifty lines of Python.

I find this one of the cleaner examples of a recurring pattern in applied math: the hard part of a problem is often the representation, not the objective. Change how you encode the constraint and an intractable problem becomes a routine one.

The problem, set up honestly

Assume a linear structural equation model. Each variable is a noisy linear function of its parents:

Xj=wjX+zj,j=1,,dX_j = w_j^\top X + z_j, \quad j = 1, \ldots, d

The whole graph lives in a weighted adjacency matrix WRd×dW \in \mathbb{R}^{d \times d}, where wij0w_{ij} \neq 0 means there’s an edge iji \to j. Fitting the model is least squares with a sparsity penalty:

F(W)=12nXXWF2+λW1F(W) = \frac{1}{2n}\left\| \mathbf{X} - \mathbf{X}W \right\|_F^2 + \lambda \|W\|_1

That part is easy. The hard part is the constraint that WW has to encode a DAG:

minWF(W)s.t.W is acyclic\min_{W} F(W) \quad \text{s.t.} \quad W \text{ is acyclic}

“Acyclic” is the word doing all the damage. It’s a combinatorial property of the sign pattern of WW, and optimizing subject to it is NP-hard. The classical approaches respect that and search; NOTEARS refuses to.

The trick: counting cycles with a matrix exponential

Here is the idea, built up the way it actually works.

A directed graph with binary adjacency matrix BB has a cycle of length kk exactly when tr(Bk)>0\mathrm{tr}(B^k) > 0 — the diagonal of BkB^k counts closed walks of length kk. So the graph is acyclic iff every power has zero trace. Summing those traces gives a single number that’s zero iff there are no cycles of any length. But raw powers of BB are numerically unstable and the sum doesn’t obviously converge.

The fix is to borrow the weighting from the matrix exponential, eB=k=0Bk/k!e^B = \sum_{k=0}^\infty B^k / k!. The factorial denominators tame the series — it converges for any square matrix — while preserving the “diagonal counts cycles” property. For a binary matrix:

tr(eB)=d    B is a DAG\mathrm{tr}(e^B) = d \iff B \text{ is a DAG}

(The dd on the right is just the k=0k=0 term, tr(I)=d\mathrm{tr}(I) = d; every higher term must vanish.)

To extend this from binary to real, signed weights, they apply the Hadamard (elementwise) square WWW \circ W. The result is the centerpiece of the paper:

h(W)=tr ⁣(eWW)d=0    W is a DAGh(W) = \mathrm{tr}\!\left(e^{W \circ W}\right) - d = 0 \iff W \text{ is a DAG}

Why square the entries? Because without it, a two-cycle 1211 \to 2 \to 1 with weights +1+1 and 1-1 contributes w12w21=1w_{12}\,w_{21} = -1 to the trace, and a cycle could cancel itself out numerically — you’d certify a cyclic graph as acyclic. Squaring forces every cyclic contribution strictly positive, so h(W)>0h(W) > 0 precisely when a cycle exists, and h(W)=0h(W) = 0 precisely when it doesn’t. As a bonus, hh doesn’t just test acyclicity — its value quantifies how cyclic the graph is, which matters for the optimizer.

And it’s smooth, with a gradient you can write in one line:

h(W)=(eWW)2W\nabla h(W) = \left(e^{W \circ W}\right)^\top \circ 2W

That’s the whole conceptual payoff. The discrete constraint “is a DAG” has become a smooth equality constraint h(W)=0h(W) = 0 with a cheap gradient.

From here it’s just constrained optimization

Once the constraint is smooth, the rest is standard machinery. NOTEARS solves

minWF(W)s.t.h(W)=0\min_W F(W) \quad \text{s.t.} \quad h(W) = 0

with the augmented Lagrangian:

Lρ(W,α)=F(W)+ρ2h(W)2+αh(W)L^\rho(W, \alpha) = F(W) + \frac{\rho}{2}\,|h(W)|^2 + \alpha\, h(W)

You minimize over WW (L-BFGS, or a proximal quasi-Newton step when the 1\ell_1 term is on), bump the multiplier αα+ρh(W)\alpha \leftarrow \alpha + \rho\, h(W), raise the penalty ρ\rho, and repeat. In practice it converges in fewer than ten outer iterations. A final hard-threshold step zeroes out tiny weights to round the near-DAG to an exact one. Each iteration costs O(d3)O(d^3) — the price of one matrix exponential.

The honest caveat: the constraint set {W:h(W)=0}\{W : h(W) = 0\} is nonconvex, so you’re only guaranteed a stationary point, not the global optimum. But the empirical results are reassuring. On 10-node graphs where an exact integer-programming solver can find the true global minimum, NOTEARS lands within 0.020.020.040.04 of it in both score and parameter distance. Against FGS it’s competitive on sparse graphs and decisively better on dense, hub-heavy ones — exactly where edge-at-a-time greedy search struggles. And it’s robust across Gaussian, Exponential, and Gumbel noise without being told which it’s facing.

The lesson that outlasts the method

NOTEARS isn’t the last word — it assumes linearity, the O(d3)O(d^3) exponential is a bottleneck at scale, and a wave of follow-up work (DAG-GNN, GOLEM, NOCURL, and the nonlinear extensions) has pushed on every one of those limits. But the contribution that mattered wasn’t a better search heuristic. It was a single equation:

h(W)=tr(eWW)dh(W) = \mathrm{tr}(e^{W \circ W}) - d

that turned a property everyone treated as inherently combinatorial into something you can take a gradient of. When a problem feels intractable, it’s worth asking whether the intractability lives in the problem or in the way you’ve written it down. Sometimes the entire difficulty is a bad change of variables away from disappearing.


Based on Zheng, Aragam, Ravikumar & Xing (2018), “DAGs with NO TEARS: Continuous Optimization for Structure Learning,” NeurIPS 31, arXiv:1803.01422.


Edit page
Share this post:

Previous Post
Designing Experiments to Maximize Information
Next Post
The Illusion of Significance: Why p-value Variable Selection Breaks Marketing Mix Models