Defending LLMs Against Jailbreaking

Defending LLMs Against Jailbreaking

Efficient safety retrofitting using Direct Preference Optimization

This research introduces a cost-effective approach to protect large language models from jailbreaking attempts without extensive retraining.

  • Leverages Direct Preference Optimization (DPO) to align models with safety preferences
  • Creates Egida, a comprehensive dataset combining multiple sources of jailbreaking attacks
  • Achieves significant Attack Success Rate reduction with minimal training resources
  • Demonstrates effectiveness across different model architectures and sizes

For security teams, this research offers practical methods to retrofit existing LLMs with enhanced safety guardrails against evolving threats, balancing protection with deployment efficiency.

Efficient Safety Retrofitting Against Jailbreaking for LLMs

99 | 157