Self-Hacking VLMs: The IDEATOR Approach

IDEATOR represents a novel method for efficiently discovering security vulnerabilities in Vision-Language Models (VLMs) by leveraging the models themselves.

Creates diverse and effective jailbreak images without human intervention
Achieves 76.7% success rate in triggering harmful responses across major VLMs
Reveals concerning safety alignment gaps in commercial VLM systems
Demonstrates that text-only safety measures are insufficient for multimodal contexts

This research is critical for the security community as it highlights how current VLM safeguards can be circumvented through automated attacks, emphasizing the need for robust multimodal safety mechanisms before widespread deployment.

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves