Code vs Text Prompt Evaluation: LLM Causal Reasoning Benchmarks

Summary

Across zero-shot and one-shot settings on abductive (αNLG) and counterfactual (TimeTravel) tasks, code prompts outperform text prompts for most models (+5.1% BLEU, +5.3% BERTScore average in zero-shot). Code-LLMs (CodeLLaMA, Codex) consistently outperform same-architecture general-purpose LLMs (LLaMA-2, GPT-3) by ~14% BLEU. The alignment tax is observed for GPT-3/Codex: fine-tuning with instructions may weaken code-derived causal abilities.

Overview

Section 5 answers RQ1 (Code-LLMs vs. general-purpose LLMs?) and RQ2 (code prompts vs. text prompts?). Experiments use automatic metrics and a human pairwise evaluation (100 examples, inter-rater reliability 0.63).

Main Content

Experimental Setup

Datasets: αNLG (3,561 instances) for abductive; TimeTravel (1,871 instances) for counterfactual.

Models:

TypeModels
General-purpose (open)LLaMA-2 7B, QWEN1.5 7B, DeepSeek-LLM 7B, Mixtral 8×7B
General-purpose (closed)Gemini-Pro, GPT-3 (text-davinci-002)
Code (open)CodeLLaMA 7B (= LLaMA-2 + code fine-tune), CodeQWEN1.5 7B
Code (closed)Codex (code-davinci-002)

Three paired comparisons share architecture, differ only in training corpus: ⟨LLaMA-2, CodeLLaMA⟩, ⟨QWEN1.5, CodeQWEN1.5⟩, ⟨GPT-3, Codex⟩.

Baselines: DELOREAN, COLD, DIFFUSION (abductive); CGMH, EduCAT, DELOREAN, COLD (counterfactual).

Zero-Shot Results

Key Numerical Results (Zero-Shot)

Abductive reasoning (Table 2):

ModelPromptBLEU₄ROUGE_LCIDErBERTScore
LLaMA-2Text4.828.744.058.0
LLaMA-2Code6.1 (+1.3)30.5 (+1.8)50.3 (+6.3)59.2 (+1.2)
CodeLLaMAText5.631.049.959.8
CodeLLaMACode6.2 (+0.6)31.7 (+0.7)55.4 (+5.5)60.1 (+0.3)
CodexText11.737.578.562.5
CodexCode13.7 (+2.0)39.6 (+2.1)81.8 (+3.3)64.9 (+2.4)
GeminiCode13.5 (+6.9)38.1 (+8.1)80.8 (+28.2)64.2 (+5.4)

Counterfactual reasoning (Table 3):

ModelPromptBLEU₄ROUGE_LBERTScore
LLaMA-2Text18.733.263.3
LLaMA-2Code33.8 (+15.1)51.5 (+18.3)72.7 (+9.4)
CodeLLaMAText57.262.879.1
CodeLLaMACode59.7 (+2.5)63.9 (+1.1)79.7 (+0.6)
CodexCode66.8 (+11.7)70.0 (+8.7)82.5 (+4.7)

Findings:

  • Code prompts outperform text prompts for all Code-LLMs and most general-purpose LLMs (exceptions: DeepSeek-LLM and QWEN1.5 on one abductive metric, GPT-3 on counterfactual).
  • Average gain: +5.1% BLEU, +5.3% BERTScore in zero-shot.
  • Codex outperforms GPT-3 despite same base architecture — but this gap is smaller after instruction tuning for GPT-3 (alignment tax).
  • Alignment tax: GPT-3’s instruction fine-tuning may weaken the code-derived causal reasoning ability present in Codex.

Code-LLM vs General-Purpose LLM (RQ1)

Paired comparisons confirm Code-LLMs are better causal reasoners:

  • CodeLLaMA and Codex outperform LLaMA-2 and GPT-3 on both tasks in both prompt formats (~14% BLEU average gain).
  • CodeQWEN1.5 is inferior to QWEN1.5 on abductive reasoning but much better on counterfactual reasoning — CodeQWEN1.5 handles the more complex branching structure better.

Format Perturbations (Table 4)

Adding syntactically valid code elements to prompts improves performance:

  • Adding return statement: improves counterfactual consistently; mixed on abductive.
  • Adding pass statement: similar pattern.
  • Making prompts syntactically valid Python generally helps, especially for counterfactual (+~5% BERTScore for CodeLLaMA with return).

One-Shot Results (Table 5)

All models improve in the one-shot setting vs. zero-shot. Code-LLMs still outperform general-purpose LLMs. Code prompts remain better than text prompts for most models. The advantage is robust across settings.

Human Evaluation (Table 6)

Pairwise comparison on 100 test examples (3 PhD annotators, inter-rater 0.63):

ComparisonCode winsTieText wins
Codex code vs GPT-3 text (abductive, coherence)40%38%22%
Codex code vs GPT-3 text (counterfactual, preservation)47.5%39.5%13%
Mixtral code vs Mixtral text (abductive, coherence)31.5%47.5%21%
Mixtral code vs Mixtral text (counterfactual, preservation)51%38.5%10.5%

Code prompts help models generate outputs more coherent with context and better at preserving original endings under counterfactual conditions.

Connections

See Also