
LLMs as Code Quality Judges
Can AI replace human evaluators in software engineering?
This research investigates whether large language models can effectively evaluate code and text in software engineering tasks, potentially replacing human reviewers.
- Strong correlation between LLM and human evaluations across multiple SE tasks
- Competitive performance of LLMs against traditional metrics like BLEU and Pass@k
- Cost-effective alternative to human evaluation with 95% reduction in evaluation time
- Clear evaluation protocols with few-shot prompting improve LLM judgment reliability
For engineering teams, this research offers a practical way to automate code quality assessment, reducing manual review time while maintaining evaluation accuracy. The findings suggest LLMs can serve as reliable first-pass evaluators in software development workflows.