Benchmarking Coding Assistants Across Languages

SWE-PolyBench introduces a comprehensive evaluation framework for LLM-based coding assistants across multiple programming languages and real-world software repositories.

Contains 2110 test instances across 21 repositories in Java, JavaScript, TypeScript, and Python
Enables repository-level, execution-based evaluation rather than isolated code snippets
Provides a more realistic assessment of how coding agents perform in production environments
Helps engineering teams select and optimize the right AI coding tools for their specific tech stack

This research matters for Engineering teams by providing objective metrics to evaluate and compare AI coding assistants before deploying them in professional software development workflows.

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents