AdaSteer: Adaptive Defense Against LLM Jailbreaks

AdaSteer: Adaptive Defense Against LLM Jailbreaks

Dynamic activation steering for stronger LLM security with fewer false positives

AdaSteer introduces a training-free method that dynamically adjusts model behavior to defend against jailbreak attacks while maintaining performance on legitimate requests.

  • Automatically calibrates defense strength based on input characteristics
  • Significantly outperforms fixed-coefficient steering approaches
  • Reduces false rejections of benign inputs while maintaining strong protection
  • Operates as a lightweight security layer without requiring model retraining

This innovation addresses a critical vulnerability in aligned language models, providing a practical solution for deploying safer AI systems in production environments where both security and performance are essential.

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

152 | 157