ImgTrojan: The Visual Backdoor Threat

ImgTrojan: The Visual Backdoor Threat

How a single poisoned image can bypass VLM safety barriers

Researchers uncovered a critical security vulnerability in Vision-Language Models (VLMs) that enables attackers to bypass safety measures through poisoned training data.

  • Successfully demonstrated how embedding triggers in training images can later be exploited to execute harmful instructions
  • Achieved up to 92% attack success rate against leading VLMs including GPT-4V and Claude
  • Created a comprehensive benchmark for measuring and quantifying these security threats
  • Demonstrated that current safety alignment methods are insufficient against this attack vector

This research highlights urgent security concerns for organizations developing or deploying VLMs, as models can be compromised during the training process with minimal detection.

ImgTrojan: Jailbreaking Vision-Language Models with ONE Image

11 | 157