LLM Expert Elicitation for Bayesian Networks

Summary

The dual-LLM approach uses two LLMs sequentially: LLM-1 (GPT-4o) proposes causal relationships from domain knowledge, and LLM-2 (Claude) verifies those relationships, identifies confounders, and flags inconsistencies. This multi-expert AI elicitation produces Bayesian network structures with lower entropy and fewer logical inconsistencies than BIC-based or human expert methods.

Overview

Motivation: Traditional expert elicitation is slow and subjective. Statistical structure learning (BIC, MIIC) lacks domain knowledge and produces causally inconsistent graphs. LLMs function as proxies for domain experts, combining broad training knowledge with the ability to reason about causal plausibility.

Dual-LLM Architecture

Data → LLM-1 (GPT-4o): Find causal structure → LLM-2 (Claude): Evaluate relationships
                ↓                                          ↓
        Initial causal graph                  Verified/refined structure → BN III

Step 1 — LLM-1 Causal Discovery Prompt (Table 2):

The prompt instructs LLM-1 to act as a domain expert and:

  • Interpret statistically suggested causal relationships from a domain knowledge perspective
  • Assess plausibility in the context of the domain (e.g., sleep health)
  • Provide reasoned explanations for why relationships are natural or unexpected

Step 2 — LLM-2 Verification Prompt (Table 3):

LLM-2 receives LLM-1’s output and is instructed to:

  • Assess the plausibility of each proposed relationship
  • Identify confounding factors or alternative explanations
  • Suggest corrections or additional relationships where appropriate

This two-step design creates a “dual-expert” system: cross-checking reduces hallucination and improves consistency.

Key Implementation Choices

  • State-of-the-art prompt engineering applied to improve output quality
  • Sequential application of multiple LLMs: consistency verified across models; disagreements flagged as potential misinterpretations
  • SEM validation: final relationships verified using structural equation modeling before inclusion in BN

Confounders Identified by LLM-2

Variables not in the dataset but identified as likely confounders:

  • Psychological well-being (depression may affect sleep duration, stress, and physical activity)
  • Work schedule flexibility (affects both stress and sleep patterns)
  • Socioeconomic status and income (influences stress and work conditions)

Bidirectional Dependencies Resolved

LLM analysis identified several bidirectional relationships:

  • Sleep Duration ↔ Stress Level
  • Heart Rate ↔ Stress Level

Since BNs are DAGs (no cycles), logical reasoning was used to orient edges. Overall, 10 out of 12 LLM-proposed relationships were confirmed by LLM-2, giving high confidence.

Resulting Causal Structure (BN III)

Key edges in LLM-derived BN (all SEM-significant except where noted):

Parent → ChildEstimatep-value
Daily_Steps → Stress_Level0.55850
Sleep_Duration → Stress_Level−0.75390
Gender → Occupation−1.33800.0001
Occupation → Stress_Level−0.04750.0008
Stress_Level → Quality_of_Sleep−0.21800.0001
Physical_Activity → Quality_of_Sleep0.01370.5989 (not significant)
Sleep_Duration → Quality_of_Sleep0.52490

Comparison to Human and BIC Methods

See BN Construction Methods Comparison for full three-way comparison.

Connections

See Also