
The Illusion of Safety: When LLMs Judge LLMs
Revealing critical flaws in using LLMs as safety evaluators
This research exposes significant reliability issues when Large Language Models are used to evaluate the safety of AI-generated content, finding they are easily manipulated by superficial changes to text.
- LLMs show poor self-consistency when repeatedly judging the same content
- LLM judges are highly susceptible to input artifacts like apologetic language or verbose phrasing
- Evaluation scores can be dramatically improved with surface-level modifications rather than actual safety improvements
- Safety evaluations using LLMs may create a false sense of security while failing to identify genuine threats
For security professionals, this highlights the urgent need for more robust evaluation frameworks that can't be gamed through simple text manipulations, ensuring AI systems are genuinely safer rather than merely appearing so.
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts