
Beyond Correctness: Benchmarking LLM Code Efficiency
Measuring and improving the time efficiency of AI-generated code
COFFE introduces the first benchmark specifically designed to evaluate the time efficiency of code generated by Large Language Models (LLMs).
- Evaluates 10 state-of-the-art LLMs across 200 diverse programming problems
- Reveals significant performance gaps between human and LLM-generated solutions (up to 54× slower)
- Demonstrates that simple efficiency-focused prompting can improve code performance by 1.4×-1.8×
- Provides insights for developing more efficient code generation systems
This research addresses a critical gap in AI code generation evaluation, emphasizing that production-ready code must be both correct AND efficient - a key consideration for real-world engineering applications.