Summary Causal DAGs: Definition, Node Contraction, and NP-Hardness
Summary
A summary causal DAG is a smaller DAG obtained from a causal DAG by grouping (contracting) nodes into clusters, producing a graph with fewer nodes while preserving causal inference utility. The paper formally defines summary DAGs, identifies three constraints they must satisfy (size, CI preservation, no spurious dependencies), and proves that finding the optimal summary DAG is NP-hard. The fundamental operation — node contraction — is equivalent to adding edges to the input DAG.
Overview
High-dimensional causal DAGs are cognitively overwhelming and error-prone to specify. Summary causal DAGs provide a principled way to reduce complexity: by merging semantically related nodes, the summary DAG becomes smaller and easier to use for inference while retaining the key causal information.
Main Content
Background: Causal DAGs and the RB
A causal DAG is a Bayesian network — a DAG in which:
- Nodes are random variables.
- Edges represent direct causal influence.
- Joint distribution: , where are the parents of .
A causal DAG is a potential cause of iff there is a directed path .
d-Separation (Background)
Two sets of nodes are d-separated given in DAG if all paths connecting them are blocked by (via specific rules about active trails). -separation implies conditional independence: .
Recursive Basis (RB)
For a causal DAG , the Recursive Basis is a set of CIs (conditional independence statements) such that every CI encoded by can be derived from using the semi-graphoid axioms.
For a DAG , the RB says: for each node , is independent of its non-descendants given its parents. This is sound and complete for d-separation.
The RB is the key object to preserve in summarization — a summary DAG that faithfully represents the RB of the original DAG preserves all its conditional independence structure.
3.1 Summary Causal DAGs
Summary Causal DAG (Definition 1)
A summary causal DAG for a causal DAG is a pair where:
- is a DAG with (strictly fewer nodes)
- is a function that partitions the nodes of among the nodes of
is obtained by applying node contraction: given nodes , contracting them produces a single node that is now a neighbor of all former neighbors of and . The edge between and is removed.
We omit when it is clear from context.
Compatibility (Definition 2)
A causal DAG is compatible with a summary DAG if there exists a function that partitions nodes among nodes such that:
- or
- and (topological order preserved within clusters)
We denote the set of all causal DAGs compatible with as .
Intuition: A summary DAG represents a set of possible worlds — all causal DAGs compatible with it. The CIs encoded by the summary are the intersection of CIs that hold across all compatible DAGs: those that are certainly present, regardless of the fine-grained structure within clusters.
Summarization Constraints
A valid summary causal DAG must satisfy:
| Constraint | Description |
|---|---|
| Size | $ |
| CI Preservation | Causal dependencies in the original DAG must be faithfully preserved (directed edges in should still hold in ) |
| No spurious dependencies | Summary should not introduce conditional dependencies that the original DAG does not imply |
The RB and missing edges: In causal DAGs, information is encoded by missing edges (absent edges imply CI). Removing edges can undermine the causal model (incorrectly implies CI). Adding edges indicates only potential causal dependence (does not necessarily compromise validity if acyclicity is maintained). Therefore: summarization = adding edges (grouping nodes that were connected = merging paths into single edges).
3.2 Causal DAG Summarization Problem
Causal DAG Summarization Problem
Input: A causal DAG and a bound . Goal: Find a summary causal DAG with such that:
- preserves causal dependencies of
- preserves the CIs represented in (to the greatest extent possible)
- does not introduce spurious conditional dependencies
Objective: Minimize , the number of edges added to the canonical causal DAG (a proxy for information loss).
Theorem 3.2 — NP-Hardness of Optimal Summarization
Finding a summary DAG whose canonical causal DAG results in the smallest number of added edges is NP-hard.
Specifically, finding where for some threshold is NP-complete.
Proof intuition: The problem reduces to finding an optimal graph partition that minimizes edge additions when nodes are merged — a variant of the graph -bisection problem known to be NP-hard.
This NP-hardness motivates the greedy approximation in CaGReS Algorithm.
Examples
Example: REDSHIFT Causal DAG (12 nodes, 23 edges)
The REDSHIFT cloud monitoring DAG tracks query execution performance: nodes include
Query Template,Returned Rows,Ret. Bytes,Num Tables,Compile Time,Plan Time,Exec. Time,Lock Wait Time,Elapsed Time, and others.A summary DAG with groups semantically related variables (e.g., compile/plan/exec time → query execution cluster). The 5-node summary in Fig. 2b has 9 edges vs. the 23 in the original — dramatically reducing complexity while preserving the key paths from query characteristics to elapsed time.
Connections
- Builds directly on Directed Acyclic Graphs — uses d-separation, do-calculus, and the Recursive Basis.
- The CI preservation goal connects to Frequentist Causal Estimation — adjustment sets for ATE estimation depend on d-separation, which must be preserved in the summary.
- The NP-hardness motivates the greedy CaGReS Algorithm.
See Also
- Canonical Causal DAGs — the canonical DAG and node-contraction-as-edge-addition
- CaGReS Algorithm — greedy solution to the NP-hard problem
- s-Separation in Summary DAGs — CI identification in summary DAGs
- Do-Calculus in Summary Causal DAGs — do-calculus identifiability in the summarized graph
- Zeng 2025 - Overview — paper overview
- Frequentist Causal Estimation — the adjustment set framework that CI preservation in summary DAGs must protect
- Bayesian Outcome Models — Bayesian causal estimation that relies on DAG structure for confounding adjustment
- Causal Discovery - Overview — structure learning precedes DAG summarization