Testing LLM Reasoning in Code Synthesis

CodeARC introduces a first-of-its-kind benchmark for evaluating how LLMs perform inductive program synthesis—creating functions from input-output examples that work on unseen data.

Enables interactive evaluation where models receive feedback on incorrect solutions
Tests generalization capabilities by requiring models to extract patterns from examples
Provides a more realistic assessment that mirrors real-world programming scenarios
Reveals strengths and limitations of current LLM agents in reasoning-intensive coding tasks

This research advances software engineering by creating better methodologies to assess AI coding assistants, helping developers understand when and how to effectively leverage these tools in production environments.

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis