LLM Prompt Format Evaluation

Which prompt format produces the best structured outputs? Tested across 10 models, 4 formats, 131 examples, 31,440 LLM calls.

Key Findings

Overall Rankings

Top 20 — Weighted Score (Accuracy 40% + Parseability 30% + Consistency 20% + No Hallucination 10%)

Format Comparison by Model

Average Score by Format (temp=0.0)

Average Score by Model (temp=0.0, best format)

Per-Task Results

Scores by Model + Format

Per-Field Accuracy (Best Config)

Temperature Impact

Consistency: temp=0.0 vs temp=0.3

Accuracy: temp=0.0 vs temp=0.3

Latency Comparison

Average Latency by Model (ms, temp=0.0)