Expanding LLM Evaluation Beyond Python

Multi-SWE-bench introduces a comprehensive benchmark for evaluating how Large Language Models resolve software issues across 7 programming languages beyond Python.

Covers Java, TypeScript, JavaScript, Go, Rust, C, and C++ with 1,632 high-quality instances
Enables fair assessment of LLM capabilities in diverse software ecosystems
Addresses a critical gap in existing benchmarks that primarily focus on Python
Provides a foundation for more robust engineering evaluation of AI code assistance tools

This research matters for Engineering teams by establishing standards to evaluate LLM performance across the multilingual software development landscape, crucial for real-world application development.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving