
Testing LLM Reasoning in Code Synthesis
Evaluating how AI agents learn from examples using a novel interactive benchmark
CodeARC introduces a first-of-its-kind benchmark for evaluating how LLMs perform inductive program synthesis—creating functions from input-output examples that work on unseen data.
- Enables interactive evaluation where models receive feedback on incorrect solutions
- Tests generalization capabilities by requiring models to extract patterns from examples
- Provides a more realistic assessment that mirrors real-world programming scenarios
- Reveals strengths and limitations of current LLM agents in reasoning-intensive coding tasks
This research advances software engineering by creating better methodologies to assess AI coding assistants, helping developers understand when and how to effectively leverage these tools in production environments.
CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis