Summary Causal DAGs: Definition, Node Contraction, and NP-Hardness

Summary

A summary causal DAG is a smaller DAG obtained from a causal DAG by grouping (contracting) nodes into clusters, producing a graph with fewer nodes while preserving causal inference utility. The paper formally defines summary DAGs, identifies three constraints they must satisfy (size, CI preservation, no spurious dependencies), and proves that finding the optimal summary DAG is NP-hard. The fundamental operation — node contraction — is equivalent to adding edges to the input DAG.

Overview

High-dimensional causal DAGs are cognitively overwhelming and error-prone to specify. Summary causal DAGs provide a principled way to reduce complexity: by merging semantically related nodes, the summary DAG becomes smaller and easier to use for inference while retaining the key causal information.

Main Content

Background: Causal DAGs and the RB

A causal DAG $G$ is a Bayesian network — a DAG $G$ in which:

Nodes $X_{i}$ are random variables.
Edges $X_{i} \to X_{j}$ represent direct causal influence.
Joint distribution: $P (X_{1}, \dots, X_{n}) = \prod_{i} P (X_{i} ∣ pa (X_{i}))$ , where $pa (X_{i})$ are the parents of $X_{i}$ .

A causal DAG $Y$ is a potential cause of $X$ iff there is a directed path $X \to \dots \to Y$ .

d-Separation (Background)

Two sets of nodes $X, Y$ are d-separated given $Z$ in DAG $G$ if all paths connecting them are blocked by $Z$ (via specific rules about active trails). $d$ -separation implies conditional independence: $X ⊥ ⊥ Y ∣ Z$ .

Recursive Basis (RB)

For a causal DAG $G$ , the Recursive Basis $RB (G)$ is a set of CIs (conditional independence statements) such that every CI encoded by $G$ can be derived from $RB (G)$ using the semi-graphoid axioms.

For a DAG $G$ , the RB $RB_{G}$ says: for each node $X_{i}$ , $X_{i}$ is independent of its non-descendants given its parents. This is sound and complete for d-separation.

The RB is the key object to preserve in summarization — a summary DAG that faithfully represents the RB of the original DAG preserves all its conditional independence structure.

3.1 Summary Causal DAGs

Summary Causal DAG (Definition 1)

A summary causal DAG for a causal DAG $G$ is a pair $(H, f)$ where:

$H$ is a DAG with $∣ V (H) ∣ < ∣ V (G) ∣$ (strictly fewer nodes)

$f : V (G) \to V (H)$ is a function that partitions the nodes of $G$ among the nodes of $H$

$H$ is obtained by applying node contraction: given nodes $U, V \in V (H)$ , contracting them produces a single node $C = f (U) = f (V)$ that is now a neighbor of all former neighbors of $U$ and $V$ . The edge between $U$ and $V$ is removed.

We omit $f$ when it is clear from context.

Compatibility (Definition 2)

A causal DAG $G$ is compatible with a summary DAG $H$ if there exists a function $f$ that partitions nodes $V (G)$ among nodes $V (H)$ such that:

$(U, V) \in E (H) \Rightarrow (f (U), f (V)) \in E (G)$ or $f (U) = f (V)$

$f (X_{i}) = f (X_{j})$ and $i < j$ (topological order preserved within clusters)

We denote the set of all causal DAGs compatible with $H$ as ${G_{i}}_{H}$ .

Intuition: A summary DAG $H$ represents a set of possible worlds — all causal DAGs compatible with it. The CIs encoded by the summary are the intersection of CIs that hold across all compatible DAGs: those that are certainly present, regardless of the fine-grained structure within clusters.

Summarization Constraints

A valid summary causal DAG must satisfy:

Constraint	Description
Size	$
CI Preservation	Causal dependencies in the original DAG must be faithfully preserved (directed edges $A \to B$ in $G$ should still hold in $H$ )
No spurious dependencies	Summary should not introduce conditional dependencies that the original DAG does not imply

The RB and missing edges: In causal DAGs, information is encoded by missing edges (absent edges imply CI). Removing edges can undermine the causal model (incorrectly implies CI). Adding edges indicates only potential causal dependence (does not necessarily compromise validity if acyclicity is maintained). Therefore: summarization = adding edges (grouping nodes that were connected = merging paths into single edges).

3.2 Causal DAG Summarization Problem

Causal DAG Summarization Problem

Input: A causal DAG $G$ and a bound $k$ . Goal: Find a summary causal DAG $(H, f)$ with $∣ V (H) ∣ = k$ such that:

$H$ preserves causal dependencies of $G$

$H$ preserves the CIs represented in $G$ (to the greatest extent possible)

$H$ does not introduce spurious conditional dependencies

Objective: Minimize $∣ E (G_{H}) ∣ - ∣ E (G) ∣$ , the number of edges added to the canonical causal DAG $G_{H}$ (a proxy for information loss).

Theorem 3.2 — NP-Hardness of Optimal Summarization

Finding a summary DAG $(H, f)$ whose canonical causal DAG $G_{H}$ results in the smallest number of added edges $∣ E (G_{H}) ∣ - ∣ E (G) ∣$ is NP-hard.

Specifically, finding $(H, f)$ where $∣ E (G_{H}) ∣ - ∣ E (G) ∣ \leq τ$ for some threshold $τ \geq 0$ is NP-complete.

Proof intuition: The problem reduces to finding an optimal graph partition that minimizes edge additions when nodes are merged — a variant of the graph $k$ -bisection problem known to be NP-hard.

This NP-hardness motivates the greedy approximation in CaGReS Algorithm.

Examples

Example: REDSHIFT Causal DAG (12 nodes, 23 edges)

The REDSHIFT cloud monitoring DAG tracks query execution performance: nodes include Query Template, Returned Rows, Ret. Bytes, Num Tables, Compile Time, Plan Time, Exec. Time, Lock Wait Time, Elapsed Time, and others.

A summary DAG with $k = 5$ groups semantically related variables (e.g., compile/plan/exec time → query execution cluster). The 5-node summary in Fig. 2b has 9 edges vs. the 23 in the original — dramatically reducing complexity while preserving the key paths from query characteristics to elapsed time.

Connections

Builds directly on Directed Acyclic Graphs — uses d-separation, do-calculus, and the Recursive Basis.
The CI preservation goal connects to Frequentist Causal Estimation — adjustment sets for ATE estimation depend on d-separation, which must be preserved in the summary.
The NP-hardness motivates the greedy CaGReS Algorithm.

Second Brain

Explorer

Summary Causal DAGs: Definition, Node Contraction, and NP-Hardness

Summary Causal DAGs: Definition, Node Contraction, and NP-Hardness

Overview

Main Content

Background: Causal DAGs and the RB

3.1 Summary Causal DAGs

Summarization Constraints

3.2 Causal DAG Summarization Problem

Examples

Connections

See Also

Graph View

Table of Contents

Backlinks