NOTEARS Experiments
Summary
Empirical evaluation of NOTEARS against state-of-the-art baselines, chiefly Fast Greedy Search (FGS) ([Ramsey et al., 2016]). Across random graph models (Erdős–Rényi, scale-free), noise types (Gaussian, Exponential, Gumbel), node counts and sample sizes , NOTEARS matches FGS on sparse graphs and decisively outperforms it as graphs get denser (SF-4) and larger. NOTEARS’s score is also close to the exact global minimizer (GOBNILP), indicating nonconvexity is a minor issue in practice.
Overview
The paper’s statistical-consistency questions are settled by prior work; these experiments target the computational/structural-recovery question: does the continuous relaxation actually recover good DAGs? Baselines: FGS (the fast greedy-search implementation, chosen as the strongest scalable baseline), plus GES, PC, and LiNGAM (PC and LiNGAM were significantly weaker and only reported in the supplement).
Main Content
Experimental design
Simulation setup (NOTEARS §5, App. D)
- Graph models: Erdős–Rényi (ER) and scale-free (SF), with expected edges (ER-2 / SF-4 denote ). SF graphs have hub nodes — a known hard case for local-search methods.
- Weights: uniformly random edge weights assigned to a random graph to form .
- SEM sampling: with three noise models — Gaussian (
Gauss), Exponential (Exp), Gumbel (Gumbel).- Sizes: , ; rows i.i.d.; 10 simulations per cell.
- Metrics: Structural Hamming Distance (SHD, lower better) and False Discovery Rate (FDR).
- Note: FGS outputs a CPDAG, so comparisons require care (App. D.1).
Parameter estimation (without thresholding)
NOTEARS run without the thresholding step () — i.e. just solving the ECP — already produces empirically consistent estimates of the true weight matrix on both ER and SF graphs (Figures 1, 2). Takeaways:
- With large (), weights are estimated very well even without -regularization.
- With small (), -regularization is essential and keeps estimates accurate.
- The final thresholding step is needed only to sharpen structure learning, not parameter accuracy.
Structure learning (SHD / FDR)
Result: NOTEARS vs FGS on structure recovery (NOTEARS §5.2, Fig. 3)
Setup. SHD and FDR to the true graph, across ER-2 / SF-4 graphs, three noise types, up to 100, for and . Methods: FGS, NOTEARS, NOTEARS-L1 (-regularized).
Findings.
- Sparse graphs (ER-2): FGS is very competitive — comparable SHD to NOTEARS.
- Denser graphs (SF-4): FGS rapidly deteriorates as edge count grows; NOTEARS shows large, consistent improvements, and the gap widens as increases.
- Uniform across noise: NOTEARS performs uniformly better for every noise model (Exp, Gauss, Gumbel) without using any knowledge of the noise type.
- Small : the -regularizer (NOTEARS-L1) helps significantly.
Interpretation. Global updates of the full matrix let NOTEARS handle high-in-degree / hub-heavy graphs (SF) — precisely where local edge-at-a-time search struggles.
Comparison to the exact global minimizer
Result: NOTEARS vs GOBNILP global optimum (NOTEARS §5.3, Table 1)
Setup. Using GOBNILP ([Cussens, 2012]) — which enumerates all parent sets, limited to small DAGs () — compute the exact global minimizer of program (3). Compare NOTEARS’s score and parameters to the global optimum, over 10 datasets, .
Findings (Table 1, selected rows).
Graph 1000 0 ER2 5.02 4.97 0.02 1000 0 SF4 5.05 4.94 0.04 20 0 ER2 5.36 3.85 0.07 20 0 SF4 4.70 3.77 0.08 Interpretation. Although NOTEARS only guarantees a stationary point, in many cases the obtained solution is very close to the global minimizer (tiny ), especially at large . Encouraging evidence that the nonconvexity of (9) is a minor issue in practice for ER and SF graphs (worst-case graphs may still be hard).
Real data
Result: Sachs protein-signaling network (NOTEARS §5.4)
Dataset. [Sachs et al., 2005] — continuous measurements of protein/phospholipid expression in human immune-system cells (, , 20 edges), a standard graphical-models benchmark with a known consensus network (experimentally validated gold standard).
Result. FGS estimated 17 total edges with SHD = 22; NOTEARS estimated 16 edges with SHD = 22 — competitive with the established baseline on real data.
Discussion: limitations & future work
Limitations the authors flag (NOTEARS §6)
- Nonconvexity. Program (9) is still nonconvex; black-box solvers find stationary points. (Note: even GES only finds the global minimizer in the limit under assumptions, not for finite .)
- Smoothness requirement. The method relies on a smooth score to use gradient-based solvers. Non-smooth / discrete scores (e.g. BDe) would need techniques like Nesterov smoothing — left to future work.
- cost. The matrix exponential is cubic in the number of nodes (small for sparse matrices); this motivates second-order methods to cut the number of evaluations. No worst-case iteration-complexity bound is established, though typically only iterations occur.
- Fixed threshold . A data-driven, adaptive choice of (vs. the fixed suboptimal value used) would be preferable across different noise-to-signal ratios and graph types.
The headline strength
NOTEARS’s main advantage is smooth, global search delegated to standard numerical solvers — as opposed to combinatorial, local search. It already outperforms existing methods when the in-degree is large, a known hard spot for prior approaches.
Connections
- Validates the design choices in NOTEARS Algorithm (global updates, thresholding, -regularization in small ).
- Empirically supports the practical relevance of Smooth Characterization of Acyclicity — the continuous relaxation recovers true DAGs.
- Benchmark family: the Sachs dataset and ER/SF/noise simulation protocol became standard for the later differentiable-DAG-learning literature.
See Also
- NOTEARS Algorithm — the method being evaluated
- NOTEARS - Overview — paper-level context
- DAG Structure Learning Problem — baselines (FGS, GES, PC, LiNGAM) situated in the landscape