Beyond Correctness: Benchmarking LLM Code Efficiency

Beyond Correctness: Benchmarking LLM Code Efficiency

Measuring and improving the time efficiency of AI-generated code

COFFE introduces the first benchmark specifically designed to evaluate the time efficiency of code generated by Large Language Models (LLMs).

  • Evaluates 10 state-of-the-art LLMs across 200 diverse programming problems
  • Reveals significant performance gaps between human and LLM-generated solutions (up to 54× slower)
  • Demonstrates that simple efficiency-focused prompting can improve code performance by 1.4×-1.8×
  • Provides insights for developing more efficient code generation systems

This research addresses a critical gap in AI code generation evaluation, emphasizing that production-ready code must be both correct AND efficient - a key consideration for real-world engineering applications.

COFFE: A Code Efficiency Benchmark for Code Generation

123 | 323