Consistency in AI Code Reviews

Consistency in AI Code Reviews

Evaluating how deterministic LLMs are when reviewing code

This research measures the consistency of leading Large Language Models (LLMs) when performing software code reviews, even with temperature set to zero.

  • GPT-4o mini proved most deterministic, with identical outputs in 90% of cases
  • Claude 3.5 Sonnet showed significant variability despite zero temperature
  • Consistency varied by review aspect: higher for syntax issues, lower for logic
  • Models exhibited different patterns when handling complex engineering decisions

For engineering teams adopting AI-assisted code reviews, this research highlights the importance of understanding model reliability and potentially running multiple reviews for critical code.

Measuring Determinism in Large Language Models for Software Code Review

193 | 323