DAG Structure Learning Problem

Summary

Sets up the problem NOTEARS solves: given $n$ i.i.d. observations of a $d$ -dimensional random vector, learn a DAG (Bayesian network) over the variables. NOTEARS models the data with a linear structural equation model (SEM) whose weighted adjacency matrix $W \in R^{d \times d}$ is the graph, and frames learning as minimizing a regularized least-squares score subject to acyclicity. The crux: although the score $F (W)$ is continuous, the discrete DAG constraint $G (W) \in D$ is the hard part.

Overview

The vault distinguishes two equivalent views of score-based DAG learning. The traditional combinatorial view optimizes a discrete score over the set of DAGs $D$ . NOTEARS’s view optimizes a continuous least-squares score over the real matrices $R^{d \times d}$ , identifying each matrix with the SEM it parameterizes. This note formalizes both and the SEM machinery linking them.

Main Content

Setup and notation

Definition: Data, DAGs, and SEM (NOTEARS §2)

Let $X \in R^{n \times d}$ be a data matrix of $n$ i.i.d. observations of the random vector $X = (X_{1}, \dots, X_{d})$ . Let $D$ denote the discrete space of DAGs $G = (V, E)$ on $d$ nodes. We model $X$ via a structural equation model (SEM) defined by a weighted adjacency matrix $W \in R^{d \times d}$ . Learning a Bayesian network = learning $G$ for the joint distribution $P (X)$ .

From matrices to graphs

Any $W \in R^{d \times d}$ defines a graph as follows.

Definition: Induced graph $G (W)$ (NOTEARS §2.1)

Let $A (W) \in {0, 1}^{d \times d}$ be the binary matrix with
$[A (W)]_{ij} = 1 ⟺ w_{ij} \neq = 0,$
and $0$ otherwise. Then $A (W)$ is the adjacency matrix of a directed graph $G (W)$ . By a slight abuse of notation, $W$ itself is treated as a (weighted) graph.

The linear SEM

Writing $W = [w_{1} ∣ \dots ∣ w_{d}]$ in columns, $W$ defines a linear SEM:

Definition: Linear SEM (NOTEARS §2.1)

$X_{j} = w_{j}^{T} X + z_{j}, j = 1, \dots, d,$
where $X = (X_{1}, \dots, X_{d})$ is the random vector and $z = (z_{1}, \dots, z_{d})$ is a random noise vector. Crucially, $z$ is not assumed Gaussian. More generally one can use a GLM $E (X_{j} ∣ X_{pa (X_{j})}) = f (w_{j}^{T} X)$ — e.g. logistic regression for binary $X_{j}$ .

The least-squares score

NOTEARS focuses on the least-squares (LS) loss, though everything applies to any smooth loss $ℓ$ .

Definition: Regularized LS score $F (W)$ (NOTEARS §2.1, Eq. 2)

With LS loss $ℓ (W; X) = \frac{1}{2 n} ∥ X - X W ∥_{F}^{2}$ and $ℓ_{1}$ -regularization $∥ W ∥_{1} = ∥ vec (W) ∥_{1}$ to encourage sparsity:
$F (W) = ℓ (W; X) + λ ∥ W ∥_{1} = \frac{1}{2 n} ∥ X - X W ∥_{F}^{2} + λ ∥ W ∥_{1} .$

Statistical justification

The minimizer of the LS loss provably recovers a true DAG with high probability in finite samples and high dimensions ( $d ≫ n$ ), consistent for both Gaussian SEM ([van de Geer & Bühlmann, 2013; Aragam et al., 2016]) and non-Gaussian SEM ([Loh & Bühlmann, 2014]). These results do not require the faithfulness assumption. Given this prior work on statistical issues, NOTEARS focuses entirely on the computational problem of finding the SEM that minimizes the LS loss.

The continuous program (NOTEARS’s target)

Program (3): Continuous score-based DAG learning (NOTEARS §2.1)

$W \in R^{d \times d} min F (W) subject to G (W) \in D .$
$F (W)$ is continuous, but the DAG constraint $G (W) \in D$ remains combinatorial and is the central obstacle — resolved in Smooth Characterization of Acyclicity.

Contrast: the traditional combinatorial program

Program (4): Traditional combinatorial score-based learning (NOTEARS §2.2)

$G min Q (G) subject to G \in D,$
where $Q : D \to R$ is a discrete score (BDe(u), BGe, BIC, MDL). Program (4) is NP-hard ([Chickering, 1996; Chickering et al., 2004]) owing to the nonconvex, combinatorial acyclicity constraint, whose number of acyclic structures grows superexponentially in $d$ ([Robinson, 1977]).

The essential distinction: program (4)‘s domain is the discrete set $D$ , whereas program (3)‘s domain is the continuous $R^{d \times d}$ .

Landscape of prior approaches

Camp	Idea	Limitation
Exact ([Cussens, 2012]; GOBNILP; [Chen et al., 2016])	Guaranteed globally optimal	Only a few dozen nodes; intractable in general
Local / approximate search (FGS, GES, hill-climbing, MMHC)	Add edges/parents one node at a time, check acyclicity incrementally	Needs bounded in-degree/treewidth — impossible to verify; real networks are scale-free with hub nodes
Order search ([Teyssier & Koller, 2005])	Search over $d!$ topological orderings	Trades acyclicity for an exponential ordering search
Constraint-based (PC, [Spirtes & Glymour, 1991])	Conditional-independence tests	Different paradigm; often less accurate
Hybrid / Bayesian (MMHC; [Zhou, 2011])	Combine the above	Conceptual complexity

The "conceptual clarity" gap NOTEARS targets

A recurring drawback the authors emphasize: prior methods are conceptually complex — they require deep graphical-model knowledge and clever tricks to accelerate. NOTEARS needs none: “implementable in just a few lines of code using existing black-box solvers.”

Connections

Undirected analogy: undirected (Markov network) structure learning is a convex log-det program ([Banerjee et al., 2008]); the directed case resisted such treatment until NOTEARS.
Local vs. global: NOTEARS updates the entire $W$ at each step (global), unlike edge-at-a-time local search.
SEM grounding: see Confirmatory Factor Analysis and SEM for structural equation models in the Bayesian setting; the weighted adjacency matrix here is the SEM coefficient matrix.

Second Brain

Explorer

DAG Structure Learning Problem

DAG Structure Learning Problem

Overview

Main Content

Setup and notation

From matrices to graphs

The linear SEM

The least-squares score

The continuous program (NOTEARS’s target)

Contrast: the traditional combinatorial program

Landscape of prior approaches

Connections

See Also

Graph View

Table of Contents

Backlinks