Fortifying VLMs Against Adversarial Attacks

Fortifying VLMs Against Adversarial Attacks

A novel DPO approach for safer vision-language models

Adversary-aware DPO (ADPO) strengthens vision-language models against sophisticated jailbreak attempts by incorporating adversarial training into the safety alignment process.

  • Addresses critical white-box attack vulnerabilities in current post-hoc safety fine-tuning methods
  • Improves model robustness by explicitly considering adversarial examples during training
  • Demonstrates superior performance against jailbreak attacks compared to standard approaches
  • Creates VLMs that maintain alignment with human values while refusing harmful queries

This research advances security in multimodal AI systems by proactively defending against evolving threats, ensuring safer deployment in real-world applications where visual inputs could be manipulated.

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

49 | 100