For most of its history, learning a causal graph from data was a search problem, and a brutal one. You want the directed acyclic graph that best explains your data, but the number of DAGs on nodes grows superexponentially — faster than , faster than . So the field built clever heuristics: add an edge, check it doesn’t create a cycle, score the result, greedily keep going. PC, GES, FGS — all variations on walking the discrete space one edge at a time, because you can’t enumerate it and you can’t differentiate it.
In 2018, Zheng, Aragam, Ravikumar, and Xing published a paper with a title that tells you exactly how the field felt about this: DAGs with NO TEARS. Their move was to stop searching the discrete space at all. They found a way to write “this graph is acyclic” as a smooth equation, and once you can do that, structure learning becomes continuous optimization. You can run a gradient-based solver. The whole thing is about fifty lines of Python.
I find this one of the cleaner examples of a recurring pattern in applied math: the hard part of a problem is often the representation, not the objective. Change how you encode the constraint and an intractable problem becomes a routine one.
The problem, set up honestly
Assume a linear structural equation model. Each variable is a noisy linear function of its parents:
The whole graph lives in a weighted adjacency matrix , where means there’s an edge . Fitting the model is least squares with a sparsity penalty:
That part is easy. The hard part is the constraint that has to encode a DAG:
“Acyclic” is the word doing all the damage. It’s a combinatorial property of the sign pattern of , and optimizing subject to it is NP-hard. The classical approaches respect that and search; NOTEARS refuses to.
The trick: counting cycles with a matrix exponential
Here is the idea, built up the way it actually works.
A directed graph with binary adjacency matrix has a cycle of length exactly when — the diagonal of counts closed walks of length . So the graph is acyclic iff every power has zero trace. Summing those traces gives a single number that’s zero iff there are no cycles of any length. But raw powers of are numerically unstable and the sum doesn’t obviously converge.
The fix is to borrow the weighting from the matrix exponential, . The factorial denominators tame the series — it converges for any square matrix — while preserving the “diagonal counts cycles” property. For a binary matrix:
(The on the right is just the term, ; every higher term must vanish.)
To extend this from binary to real, signed weights, they apply the Hadamard (elementwise) square . The result is the centerpiece of the paper:
Why square the entries? Because without it, a two-cycle with weights and contributes to the trace, and a cycle could cancel itself out numerically — you’d certify a cyclic graph as acyclic. Squaring forces every cyclic contribution strictly positive, so precisely when a cycle exists, and precisely when it doesn’t. As a bonus, doesn’t just test acyclicity — its value quantifies how cyclic the graph is, which matters for the optimizer.
And it’s smooth, with a gradient you can write in one line:
That’s the whole conceptual payoff. The discrete constraint “is a DAG” has become a smooth equality constraint with a cheap gradient.
From here it’s just constrained optimization
Once the constraint is smooth, the rest is standard machinery. NOTEARS solves
with the augmented Lagrangian:
You minimize over (L-BFGS, or a proximal quasi-Newton step when the term is on), bump the multiplier , raise the penalty , and repeat. In practice it converges in fewer than ten outer iterations. A final hard-threshold step zeroes out tiny weights to round the near-DAG to an exact one. Each iteration costs — the price of one matrix exponential.
The honest caveat: the constraint set is nonconvex, so you’re only guaranteed a stationary point, not the global optimum. But the empirical results are reassuring. On 10-node graphs where an exact integer-programming solver can find the true global minimum, NOTEARS lands within – of it in both score and parameter distance. Against FGS it’s competitive on sparse graphs and decisively better on dense, hub-heavy ones — exactly where edge-at-a-time greedy search struggles. And it’s robust across Gaussian, Exponential, and Gumbel noise without being told which it’s facing.
The lesson that outlasts the method
NOTEARS isn’t the last word — it assumes linearity, the exponential is a bottleneck at scale, and a wave of follow-up work (DAG-GNN, GOLEM, NOCURL, and the nonlinear extensions) has pushed on every one of those limits. But the contribution that mattered wasn’t a better search heuristic. It was a single equation:
that turned a property everyone treated as inherently combinatorial into something you can take a gradient of. When a problem feels intractable, it’s worth asking whether the intractability lives in the problem or in the way you’ve written it down. Sometimes the entire difficulty is a bad change of variables away from disappearing.
Based on Zheng, Aragam, Ravikumar & Xing (2018), “DAGs with NO TEARS: Continuous Optimization for Structure Learning,” NeurIPS 31, arXiv:1803.01422.