NLP Causal Extraction Methods
Summary
Two NLP methods extract causal relationships from Japanese free-text input. Method A uses clue expressions and dependency parsing (high precision, ~46% recall). Method B decomposes sentences into simple sub-sentences and treats ordering as causal direction (higher recall ~63%). Combined, they achieve 87% coverage on a 100-sentence test corpus.
Overview
The system needs to automatically identify which part of a participant’s free-text input is the cause and which is the effect. Two complementary methods are used, both relying on dependency analysis via CaboCha (Japanese NLP parser).
Method A — Clue Expression Extraction
Approach: Searches for linguistic clue expressions that signal causality (e.g., “No-de” in Japanese, roughly equivalent to “because/as” in English). When a clue expression is found, dependency analysis determines sentence structure and identifies cause and effect phrases.
Five sentence patterns recognized (from Sakaji et al.):
| Pattern | Structure |
|---|---|
| A | Cause phrase → Clue phrase → Effect phrase |
| B | Subject of result phrase → Cause phrase → Clue phrase → Predicate of result phrase |
| C | Result phrase → Cause phrase → Clue phrase |
| D | Sentence 1: Result phrase / Sentence 2: Cause phrase → Clue phrase |
| E | Sentence 1: Cause phrase / Sentence 2: Clue phrase → Result phrase |
Limitation: Many Japanese causal sentences do not use explicit clue expressions — this method cannot extract them.
Method B — Sentence Decomposition
Approach: Divides compound sentences into simple sub-sentences. In Japanese, earlier sub-sentences tend to describe causes and later sub-sentences tend to describe effects.
Algorithm (flowchart, Fig. 3 in paper):
- Run dependency analysis with CaboCha
- For each predicate, check if it has a noun clause modifier (NC = noun clause)
- If modifier is a noun clause → divide the sentence at that point
- The first sub-sentence is assigned as “cause”; the following is assigned as “effect”
- For 3+ sub-sentences: sentence 1 causes sentence 2, sentence 2 causes sentence 3 (chain)
Trade-off: Higher recall than Method A (extracts more causalities) but also higher false positive rate (extracts some non-causal sequences as causal).
Combination Strategy
Both methods are applied independently. Their outputs are presented together to the participant in the GUI. The participant selects the correct causal pair (or enters one manually). The human confirmation step corrects false positives.
Verification experiment results (Kyoto University Web Documentation Lead Corpus, 100 sentences):
| Method | Correct identifications |
|---|---|
| Method A | 46 |
| Method B | 63 |
| Method A + B combined | 87 |
Both methods failed on the same 14 sentences, suggesting they have complementary strengths.
Deduplication — Word2Vec Similarity
To prevent the same event being stored multiple times with different phrasing (e.g., “blackout occurs” vs. “electricity stops”), new sentences are compared to existing database entries using Word2Vec vector similarity.
Procedure:
- Morphological analysis of sentence (using CaboCha)
- Compute weighted average of morpheme vectors (weighting verb, adjective, noun morphemes)
- Compare resulting vector against the database vector set using cosine distance
- Sentences within a fixed distance threshold are flagged as potential duplicates and shown to the participant
- If the participant confirms similarity → the existing entry is used; otherwise is added to
Key detail: Only verb, adjective, and noun vectors are included (these carry the semantic content in Japanese); particles and other function words are excluded.
Connections
- Provides the NLP backend for Interactive Knowledge Elicitation Method
- The human-in-the-loop confirmation in Interactive Knowledge Elicitation Method corrects false positives from Method B
- Compare to LLM Expert Elicitation for Bayesian Networks where LLMs replace rule-based NLP for causal structure extraction
See Also
- Yamashita 2020 - Overview — paper context
- Interactive Knowledge Elicitation Method — how these methods are used in practice