
Expanding LLM Evaluation Beyond Python
A new multilingual benchmark for code issue resolution
Multi-SWE-bench introduces a comprehensive benchmark for evaluating how Large Language Models resolve software issues across 7 programming languages beyond Python.
- Covers Java, TypeScript, JavaScript, Go, Rust, C, and C++ with 1,632 high-quality instances
- Enables fair assessment of LLM capabilities in diverse software ecosystems
- Addresses a critical gap in existing benchmarks that primarily focus on Python
- Provides a foundation for more robust engineering evaluation of AI code assistance tools
This research matters for Engineering teams by establishing standards to evaluate LLM performance across the multilingual software development landscape, crucial for real-world application development.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving