The JOOD Attack: Fooling AI Guardrails

This research reveals a novel jailbreaking technique that exploits out-of-distribution (OOD) scenarios to bypass safety guardrails in language and multimodal models.

Introduces the JOOD attack framework that successfully compromises both text-based and multimodal LLMs
Demonstrates high success rates against leading models including GPT-4 (38.7%) and Claude (25.3%)
Shows that even safety-aligned systems remain vulnerable to OOD attack strategies
Reveals multimodal models are particularly susceptible when text and image modalities create conflicting contexts

This work highlights critical security vulnerabilities that could enable malicious actors to extract harmful content from seemingly secure AI systems, emphasizing the need for more robust safety alignment techniques beyond current implementations.

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy