ImgTrojan: The Visual Backdoor Threat

Researchers uncovered a critical security vulnerability in Vision-Language Models (VLMs) that enables attackers to bypass safety measures through poisoned training data.

Successfully demonstrated how embedding triggers in training images can later be exploited to execute harmful instructions
Achieved up to 92% attack success rate against leading VLMs including GPT-4V and Claude
Created a comprehensive benchmark for measuring and quantifying these security threats
Demonstrated that current safety alignment methods are insufficient against this attack vector

This research highlights urgent security concerns for organizations developing or deploying VLMs, as models can be compromised during the training process with minimal detection.

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image