Rethinking LLM Jailbreaks

This research reveals that many apparent LLM jailbreaks are actually hallucinations rather than genuine security breaches, challenging current evaluation methods.

Many reported jailbreaks are mistakenly classified harmful outputs that are hallucinated, not actual compliance with harmful requests
The paper introduces BabyBLUE, a more accurate benchmark for evaluating jailbreak vulnerabilities
Current red teaming methods may overestimate LLM vulnerabilities due to failure to distinguish between non-compliance and hallucinations
Improved evaluation metrics are essential for developing effective safety mechanisms

For security professionals, this work highlights the critical importance of developing more nuanced and accurate methods to assess genuine safety risks in LLMs, preventing resources from being misdirected toward false vulnerabilities.

"Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak