LLM Causal Reasoning Tasks: Abductive NLG and Counterfactual Reasoning
Summary
Liu et al. study two unsupervised causal reasoning tasks — Abductive NLG (generate a plausible connecting hypothesis given premise and ending) and Counterfactual Reasoning (minimally edit an ending to accommodate a counterfactual event). Both are zero-shot generation problems where models must exploit internal causal structure without task-specific training. The causal relationships in both tasks form DAGs that
if/elifcode statements can represent directly.
Overview
The paper treats both tasks as unsupervised zero-shot learning: models receive only a task description and must generate outputs using pre-trained knowledge, without labeled causal examples. This tests the causal reasoning capabilities already present in (or elicitable from) the model.
Main Content
3.1 Abductive Reasoning Task
Abductive Reasoning (αNLG formulation)
Given a premise and an ending (observable states), generate a plausible hypothesis that explains how the premise could lead to the ending.
Formally, models must maximize .
The chronological ordering is : premise happens first, hypothesis connects it to the ending. The task is non-monotonic — the ending constrains the hypothesis even though it occurs after in time.
Key challenge: Non-monotonic reasoning — the model must consider not just the premise but also the future context when generating . Simple left-to-right prediction fails because the ending must be consistent with the hypothesis.
Dataset — αNLG:
- 3,561 test instances from ROCStories (5-sentence crowd-sourced stories)
- Premise = first sentence; Ending = last sentence
- 4.02 plausible hypotheses annotated per instance (crowd-sourced)
3.2 Counterfactual Reasoning Task
Counterfactual Reasoning (TimeTravel formulation)
Given a story with premise , initial context , original ending , and a counterfactual event (which contradicts ), generate a new ending that:
- Maximally preserves (minimal edits)
- Is coherent with the counterfactual context
Formally, maximize , where balances similarity against counterfactual coherence.
Key challenge: The model must both (a) understand the causal relationships driving the narrative and (b) surgically edit to accommodate while preserving unaffected parts. This requires distinguishing core causal chains from spurious correlations.
Dataset — TimeTravel:
- 1,871 test instances, also from ROCStories
- 4-part input: premise , initial context , original ending (3 sentences), counterfactual event
- 3 annotated counterfactual endings per instance
Causal Structure in Both Tasks
Both tasks share a common causal DAG structure:
Abductive: O_P → H → O_E
Counterfactual: P → C → E
↓
C' → E'
This branching structure (alternative paths from premise) maps directly to if/elif conditional statements in code — which is the core insight exploited in Code Prompts for Causal Structure.
Evaluation Metrics
| Task | Metrics |
|---|---|
| Abductive | BLEU₄, ROUGE_L, CIDEr, BERTScore |
| Counterfactual | BLEU₄, ROUGE_L, BERTScore |
CIDEr is used for abductive reasoning because it amplifies rare/unique words, measuring whether the hypothesis captures specific causal details. BERTScore captures semantic similarity; BLEU/ROUGE capture lexical overlap.
Connections
- The counterfactual task connects to Potential Outcomes Framework — vs. are analogous to potential outcomes vs. under interventions vs. .
- The non-monotonic nature of abductive reasoning resembles the backward-induction problem in Causal Estimands — estimating causes from effects.
- Both tasks’ causal graphs can be represented as DAGs — see Directed Acyclic Graphs and Summary Causal DAGs for formal treatment.
See Also
- Code Prompts for Causal Structure — how these task structures are encoded in code
- Liu 2025 - Overview — paper overview and results