DAG Structure Learning Problem
Summary
Sets up the problem NOTEARS solves: given i.i.d. observations of a -dimensional random vector, learn a DAG (Bayesian network) over the variables. NOTEARS models the data with a linear structural equation model (SEM) whose weighted adjacency matrix is the graph, and frames learning as minimizing a regularized least-squares score subject to acyclicity. The crux: although the score is continuous, the discrete DAG constraint is the hard part.
Overview
The vault distinguishes two equivalent views of score-based DAG learning. The traditional combinatorial view optimizes a discrete score over the set of DAGs . NOTEARS’s view optimizes a continuous least-squares score over the real matrices , identifying each matrix with the SEM it parameterizes. This note formalizes both and the SEM machinery linking them.
Main Content
Setup and notation
Definition: Data, DAGs, and SEM (NOTEARS §2)
Let be a data matrix of i.i.d. observations of the random vector . Let denote the discrete space of DAGs on nodes. We model via a structural equation model (SEM) defined by a weighted adjacency matrix . Learning a Bayesian network = learning for the joint distribution .
From matrices to graphs
Any defines a graph as follows.
Definition: Induced graph (NOTEARS §2.1)
Let be the binary matrix with
and otherwise. Then is the adjacency matrix of a directed graph . By a slight abuse of notation, itself is treated as a (weighted) graph.
The linear SEM
Writing in columns, defines a linear SEM:
Definition: Linear SEM (NOTEARS §2.1)
where is the random vector and is a random noise vector. Crucially, is not assumed Gaussian. More generally one can use a GLM — e.g. logistic regression for binary .
The least-squares score
NOTEARS focuses on the least-squares (LS) loss, though everything applies to any smooth loss .
Definition: Regularized LS score (NOTEARS §2.1, Eq. 2)
With LS loss and -regularization to encourage sparsity:
Statistical justification
The minimizer of the LS loss provably recovers a true DAG with high probability in finite samples and high dimensions (), consistent for both Gaussian SEM ([van de Geer & Bühlmann, 2013; Aragam et al., 2016]) and non-Gaussian SEM ([Loh & Bühlmann, 2014]). These results do not require the faithfulness assumption. Given this prior work on statistical issues, NOTEARS focuses entirely on the computational problem of finding the SEM that minimizes the LS loss.
The continuous program (NOTEARS’s target)
Program (3): Continuous score-based DAG learning (NOTEARS §2.1)
is continuous, but the DAG constraint remains combinatorial and is the central obstacle — resolved in Smooth Characterization of Acyclicity.
Contrast: the traditional combinatorial program
Program (4): Traditional combinatorial score-based learning (NOTEARS §2.2)
where is a discrete score (BDe(u), BGe, BIC, MDL). Program (4) is NP-hard ([Chickering, 1996; Chickering et al., 2004]) owing to the nonconvex, combinatorial acyclicity constraint, whose number of acyclic structures grows superexponentially in ([Robinson, 1977]).
The essential distinction: program (4)‘s domain is the discrete set , whereas program (3)‘s domain is the continuous .
Landscape of prior approaches
| Camp | Idea | Limitation |
|---|---|---|
| Exact ([Cussens, 2012]; GOBNILP; [Chen et al., 2016]) | Guaranteed globally optimal | Only a few dozen nodes; intractable in general |
| Local / approximate search (FGS, GES, hill-climbing, MMHC) | Add edges/parents one node at a time, check acyclicity incrementally | Needs bounded in-degree/treewidth — impossible to verify; real networks are scale-free with hub nodes |
| Order search ([Teyssier & Koller, 2005]) | Search over topological orderings | Trades acyclicity for an exponential ordering search |
| Constraint-based (PC, [Spirtes & Glymour, 1991]) | Conditional-independence tests | Different paradigm; often less accurate |
| Hybrid / Bayesian (MMHC; [Zhou, 2011]) | Combine the above | Conceptual complexity |
The "conceptual clarity" gap NOTEARS targets
A recurring drawback the authors emphasize: prior methods are conceptually complex — they require deep graphical-model knowledge and clever tricks to accelerate. NOTEARS needs none: “implementable in just a few lines of code using existing black-box solvers.”
Connections
- Undirected analogy: undirected (Markov network) structure learning is a convex log-det program ([Banerjee et al., 2008]); the directed case resisted such treatment until NOTEARS.
- Local vs. global: NOTEARS updates the entire at each step (global), unlike edge-at-a-time local search.
- SEM grounding: see Confirmatory Factor Analysis and SEM for structural equation models in the Bayesian setting; the weighted adjacency matrix here is the SEM coefficient matrix.
See Also
- NOTEARS - Overview — paper-level summary
- Smooth Characterization of Acyclicity — how the constraint becomes
- NOTEARS Algorithm — solving program (3)
- Spurious Association and Confounds — DAG semantics in causal inference