Defending LLMs Against Jailbreaking

This research introduces a cost-effective approach to protect large language models from jailbreaking attempts without extensive retraining.

Leverages Direct Preference Optimization (DPO) to align models with safety preferences
Creates Egida, a comprehensive dataset combining multiple sources of jailbreaking attacks
Achieves significant Attack Success Rate reduction with minimal training resources
Demonstrates effectiveness across different model architectures and sizes

For security teams, this research offers practical methods to retrofit existing LLMs with enhanced safety guardrails against evolving threats, balancing protection with deployment efficiency.

Efficient Safety Retrofitting Against Jailbreaking for LLMs