
Defending LLMs Against Multi-turn Attacks
A Control Theory Approach to LLM Security
This research introduces a safety steering framework that protects large language models from sophisticated multi-turn jailbreaking attacks by treating dialogue as a dynamical system.
- Addresses a critical vulnerability in LLMs where attackers gradually manipulate conversations to elicit harmful responses
- Employs control theory principles to maintain safe dialogue trajectories across multiple interactions
- Demonstrates superior protection against adversarial manipulation compared to existing single-turn defenses
- Provides a systematic approach to security that preserves model functionality while blocking harmful outputs
This advancement is crucial for deploying LLMs in sensitive environments where security breaches could have significant consequences, offering a more robust defense against evolving attack strategies.
Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks