Code Prompt Aspects Analysis: What Makes Code Prompts Effective for Causal Reasoning

Summary

Section 6 of Liu et al. runs a systematic intervention study on four aspects of code prompts — information, structure, format, and language — to isolate what drives performance gains. The finding: programming structure (conditional statements) is the most critical factor, causing a ~10% BLEU / ~6% BERTScore drop when removed. Information and format perturbations have smaller effects. Models are robust to language (Python vs. Java/C) and format (function vs. class vs. print), but extremely sensitive to the conditional control flow structure.

Overview

After establishing that code prompts outperform text prompts (§5), Section 6 asks why. The analysis decomposes code prompts along four dimensions using controlled interventions on LLaMA-2 and Codex.

Main Content

Four Intervention Dimensions

Four Aspects of Code Prompts

AspectPerturbationWhat It Tests
InformationNo Instruction: remove task description; Function Name Perturbation: replace premise(), hypothesis() with functionA(), functionB()Whether task instructions and semantic function names contribute beyond the code structure
StructureSequential Structure: convert if hypothesis(): ending() to premise(); hypothesis(); ending() (no conditional); Disruption: randomly swap function positions in main()Whether the conditional control flow itself is the key element
FormatClass: wrap functions in a class __init__ method; Print: replace comment bodies with print('...')Whether the syntactic format (function vs. class vs. print) matters
LanguageJava: translate Python to Java; C: translate Python to C (auto-translated by Codex)Whether Python-specific patterns drive the gains vs. general code syntax

6.2 Intervention Results (Table 8)

Abductive reasoning (LLaMA-2 baseline: BLEU 6.1 / BERTScore 59.2; Codex baseline: 13.7 / 64.9):

AspectPerturbationLLaMA-2 BLEUCodex BLEUKey change
InformationNo Instruction6.212.1Small drop; task instruction matters modestly
InformationFunction Name Perturb6.615.1Codex improves — function names may add noise
StructureSequential5.09.6↓10% Codex BLEU — largest single drop
StructureDisruption4.97.9↓42% Codex BLEU — catastrophic loss
FormatClass5.616.0Codex improves slightly
FormatPrint6.313.8Roughly similar
LanguageJava6.616.5Codex best result — Java conditional syntax is effective
LanguageC5.815.5Similar improvement

Counterfactual reasoning (LLaMA-2 baseline: BLEU 33.8 / BERTScore 72.7; Codex baseline: 66.8 / 82.5):

AspectPerturbationLLaMA-2 BLEUCodex BLEU
StructureSequential21.843.4
StructureDisruption3.916.0
FormatPrint38.573.3
LanguageJava41.971.1

Key Findings

Finding: Programming Structure is the Critical Factor

Changing the conditional structure to a sequential structure causes an average performance drop of ~10% BLEU and ~6% BERTScore on abductive reasoning. The Disruption intervention (scrambling function positions by swapping 2 characters) causes drops of ~17% BLEU on average — demonstrating extreme sensitivity to the conditional ordering.

This establishes that it is not the code format, programming language, or task description that drives code prompt effectiveness — it is the causal control flow encoded in if/elif statements.

Information: Models do not heavily rely on task instructions. Even without function names, Codex can reason from conditional structure alone. LLaMA-2 suffers more from function name removal (weaker code understanding → relies on semantic names as clues).

Format: Models are relatively robust to format changes. Java and C conditional syntax achieves similar or better results than Python for Codex — suggesting the benefit is from conditional logic in any programming language, not Python-specific.

Language: Java/C code prompts achieve similar or better performance than Python for Codex. This rules out Python-specific training data as the explanation and points to conditional logic as the universal mechanism.

Interpretation

The results paint a consistent picture:

  • LLMs are highly sensitive to structural changes (conditional flow) but robust to surface changes (format, language, function names).
  • This aligns with the hypothesis that code prompts work because they encode causal structure explicitly — the if/elif syntax represents conditional causation in a way that is both syntactically unambiguous and semantically meaningful to code-trained LLMs.
  • General-purpose LLMs (LLaMA-2) benefit less because they are less trained on code and may rely more on semantic function names.

Connections

  • The finding that structure matters more than language connects to the Code Prompts for Causal Structure design — the DAG-to-code mapping works across languages.
  • The sensitivity to disruption (swapping 2 characters) parallels the sensitivity of causal DAGs to edge additions/deletions shown in Summary Causal DAGs — small structural changes can change all downstream inferences.
  • This informs Fine-tuning on Conditional Statements — training should focus on conditional statement patterns, not just general code.

See Also