Trust at Scale: Evaluating LLM Reliability

This research introduces a rigorous framework for evaluating the reliability of LLM judgments using statistical methods derived from psychometrics.

Demonstrates that LLMs can provide inconsistent judgments even in deterministic settings
Introduces McDonald's omega as a reliability metric for AI systems
Reveals that multiple samples from LLMs significantly improve judgment reliability
Highlights critical security implications for systems relying on single LLM outputs

For security professionals, this research provides essential insights into quantifying and mitigating risks when deploying LLMs in high-stakes decision-making scenarios. The framework enables more robust reliability testing before deployment.

Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge