
Defending LLMs Against Jailbreaking
Efficient safety retrofitting using Direct Preference Optimization
This research introduces a cost-effective approach to protect large language models from jailbreaking attempts without extensive retraining.
- Leverages Direct Preference Optimization (DPO) to align models with safety preferences
- Creates Egida, a comprehensive dataset combining multiple sources of jailbreaking attacks
- Achieves significant Attack Success Rate reduction with minimal training resources
- Demonstrates effectiveness across different model architectures and sizes
For security teams, this research offers practical methods to retrofit existing LLMs with enhanced safety guardrails against evolving threats, balancing protection with deployment efficiency.