CaGReS Algorithm: Greedy Causal DAG Summarization

Summary

CaGReS (Causal Greedy Summarization) is a bottom-up greedy algorithm for causal DAG summarization. In each iteration it finds the pair of nodes whose contraction adds the fewest edges to the canonical causal DAG (measured by GetCost), and merges them — until the size constraint is met. Four optimizations improve runtime: semantic similarity constraint (only semantically related nodes are merged), caching (invalid pairs and scores), low-cost merges (pre-contract non-branching chains), and time complexity of .

Overview

Since optimal causal DAG summarization is NP-hard (Theorem 3.2 in Summary Causal DAGs), CaGReS uses a greedy approximation. The key insight from Canonical Causal DAGs (Theorem 4.1) is that contraction costs can be measured precisely by counting the edges added to the canonical DAG — enabling a well-defined greedy objective.

Main Content

Algorithm 1: CaGReS

CaGReS Algorithm

Input: Causal DAG , node count bound . Output: Summary causal DAG with nodes.

1.  H ← G
2.  H ← LowCostMerges(H)        # pre-contract cheap pairs
3.  while size(H.nodes) > k do
4.      min_cost ← ∞
5.      (X, Y) ← Null
6.      for (U, V) ∈ H.nodes do
7.          if IsValidPair(U, V, H) then
8.              cost_UV ← GetCost(U, V, H)
9.              if cost_UV < min_cost then
10.                 min_cost ← cost_UV
11.                 (X, Y) ← (U, V)
12.             if cost_UV == min_cost then
13.                 randomly decide whether to replace (X, Y)
14.     H.Merge(X, Y)
15. return H

IsValidPair: A pair is valid if neither nor are neighbors of each other in — merging neighboring nodes could create self-loops (violates DAG acyclicity).

GetCost: Computes the number of edges added to the canonical DAG when contracting and . See Algorithm 2 below.

Random tie-breaking (line 13): When multiple pairs achieve the minimum cost, CaGReS randomly selects among them. This can be determinized with a fixed random seed for reproducibility.

Algorithm 2: GetCost Procedure

GetCost Procedure

Input: Summary causal DAG , node pair and . Output: Cost of contracting and .

The cost equals the number of edges to be added in the canonical causal DAG when merging and :

  1. Within-cluster edges (line 3–4): If .HasEdge, cost += . These are new within-cluster edges connecting the two groups.

  2. New parents (lines 6–11): Let and be the predecessors of and .

    • : parents of that are not parents of post-merge.
    • cost +=
    • Similarly for parentsOnlyV and size.
  3. New children (lines 12–17): Symmetrically for successors:

    • cost +=
    • Similarly for childrenOnlyV.
  4. Return total cost.

Intuition: Merging and means every node in ‘s cluster must now be connected to every node in ‘s cluster (and vice versa for parents/children). The cost counts exactly how many new edges this requires in the canonical DAG.

Four Optimizations

CaGReS Optimizations

1. Semantic Constraint: Only semantically related node pairs are considered for contraction. A semantic similarity measure assigns values in to variable pairs; a summary DAG satisfies the semantic constraint if for every cluster , for all . This reduces the search space and ensures that only meaningfully related variables are merged.

2. Caching: Two cache mechanisms — one for invalid pairs (pairs that would violate acyclicity, checked via IsValidPair; cost is unchanged after merging a different node pair), and one for cost scores (GetCost results; a contraction of only changes the cost of pairs involving or ‘s neighbors). Caches are updated incrementally after each merge.

3. Low-Cost Merges (LowCostMerges): As pre-processing, contract node pairs that share identical children and parents (non-branching paths). These have cost 0 and are always optimal to merge first. Additionally, merge nodes linked along non-branching paths with at most one parent and one child. This is experimentally shown to be effective for small-to-medium density DAGs.

4. Time Complexity: A single cost computation for pair takes (checking at most neighbors). The algorithm runs iterations, evaluating pairs each time (with at most neighbors per node). Overall: .

Algorithm Properties

  • Acyclicity preservation: CaGReS only contracts non-neighboring pairs (IsValidPair check). The contracted result maintains DAG acyclicity because no self-loops are created.
  • CI preservation: By Theorem 4.1, the cost function (edge additions to canonical DAG) correctly measures CI information loss — minimizing cost = maximizing preserved CIs.
  • No theoretical guarantee: CaGReS is a heuristic; it does not guarantee the optimal solution. But experimentally it produces summary DAGs that are very close to Brute-Force (optimal) on small graphs, and outperforms all other practical baselines on large graphs.

Examples

FLIGHTS Dataset (21 nodes → 10 nodes)

CaGReS groups semantically similar query metrics: e.g., Num Tables, Num Joins, Num Columns → one cluster (all query structural features). The resulting 10-node summary has fewer additional edges in the canonical DAG than k-Snap’s summary, and 83.3% of its CIs are implied by Brute-Force (vs. only 50% for k-Snap).

Connections

  • CaGReS implements the optimization problem defined in Summary Causal DAGs using the cost function derived from Canonical Causal DAGs.
  • The semantic constraint makes CaGReS appropriate for high-dimensional data where domain knowledge is available — analogous to informative priors in Bayesian Inference.
  • The complexity makes CaGReS the only algorithm (besides k-Snap) that handles DAGs with nodes in practice.

See Also