LLMs as Code Quality Judges

This research investigates whether large language models can effectively evaluate code and text in software engineering tasks, potentially replacing human reviewers.

Strong correlation between LLM and human evaluations across multiple SE tasks
Competitive performance of LLMs against traditional metrics like BLEU and Pass@k
Cost-effective alternative to human evaluation with 95% reduction in evaluation time
Clear evaluation protocols with few-shot prompting improve LLM judgment reliability

For engineering teams, this research offers a practical way to automate code quality assessment, reducing manual review time while maintaining evaluation accuracy. The findings suggest LLMs can serve as reliable first-pass evaluators in software development workflows.

Original Paper: Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering