Beyond Correctness: Benchmarking LLM Code Efficiency

COFFE introduces the first benchmark specifically designed to evaluate the time efficiency of code generated by Large Language Models (LLMs).

Evaluates 10 state-of-the-art LLMs across 200 diverse programming problems
Reveals significant performance gaps between human and LLM-generated solutions (up to 54× slower)
Demonstrates that simple efficiency-focused prompting can improve code performance by 1.4×-1.8×
Provides insights for developing more efficient code generation systems

This research addresses a critical gap in AI code generation evaluation, emphasizing that production-ready code must be both correct AND efficient - a key consideration for real-world engineering applications.

COFFE: A Code Efficiency Benchmark for Code Generation