The Illusion of Safety: When LLMs Judge LLMs

This research exposes significant reliability issues when Large Language Models are used to evaluate the safety of AI-generated content, finding they are easily manipulated by superficial changes to text.

LLMs show poor self-consistency when repeatedly judging the same content
LLM judges are highly susceptible to input artifacts like apologetic language or verbose phrasing
Evaluation scores can be dramatically improved with surface-level modifications rather than actual safety improvements
Safety evaluations using LLMs may create a false sense of security while failing to identify genuine threats

For security professionals, this highlights the urgent need for more robust evaluation frameworks that can't be gamed through simple text manipulations, ensuring AI systems are genuinely safer rather than merely appearing so.

Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts