Hidden Defenders Against AI Jailbreaking

Hidden Defenders Against AI Jailbreaking

Detecting attacks on vision-language models through hidden state monitoring

HiddenDetect is a novel security framework that identifies jailbreak attacks against Large Vision-Language Models (LVLMs) by analyzing internal hidden states during inference.

  • Leverages naturally-occurring safety signals within model hidden states
  • Achieves 93.3% detection accuracy across multiple attack types
  • Provides early warning capability before harmful outputs are generated
  • Requires zero additional training and minimal computational overhead

This research is critical for security teams as multimodal AI systems face increased vulnerability to adversarial attacks. HiddenDetect offers a practical defense mechanism that doesn't sacrifice model performance while protecting against malicious prompts.

HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

51 | 100