LLM Prompt Format Evaluation
Which prompt format produces the best structured outputs? Tested across 10 models, 4 formats, 131 examples, 31,440 LLM calls.
Overall Rankings
Top 20 — Weighted Score (Accuracy 40% + Parseability 30% + Consistency 20% + No Hallucination 10%)
Format Comparison by Model
Average Score by Format (temp=0.0)
Average Score by Model (temp=0.0, best format)
Per-Task Results
Scores by Model + Format
Per-Field Accuracy (Best Config)
Temperature Impact
Consistency: temp=0.0 vs temp=0.3
Accuracy: temp=0.0 vs temp=0.3
Latency Comparison
Average Latency by Model (ms, temp=0.0)