Defending LLMs Against Multi-turn Attacks

Defending LLMs Against Multi-turn Attacks

A Control Theory Approach to LLM Security

This research introduces a safety steering framework that protects large language models from sophisticated multi-turn jailbreaking attacks by treating dialogue as a dynamical system.

  • Addresses a critical vulnerability in LLMs where attackers gradually manipulate conversations to elicit harmful responses
  • Employs control theory principles to maintain safe dialogue trajectories across multiple interactions
  • Demonstrates superior protection against adversarial manipulation compared to existing single-turn defenses
  • Provides a systematic approach to security that preserves model functionality while blocking harmful outputs

This advancement is crucial for deploying LLMs in sensitive environments where security breaches could have significant consequences, offering a more robust defense against evolving attack strategies.

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

117 | 157