
AdaSteer: Adaptive Defense Against LLM Jailbreaks
Dynamic activation steering for stronger LLM security with fewer false positives
AdaSteer introduces a training-free method that dynamically adjusts model behavior to defend against jailbreak attacks while maintaining performance on legitimate requests.
- Automatically calibrates defense strength based on input characteristics
- Significantly outperforms fixed-coefficient steering approaches
- Reduces false rejections of benign inputs while maintaining strong protection
- Operates as a lightweight security layer without requiring model retraining
This innovation addresses a critical vulnerability in aligned language models, providing a practical solution for deploying safer AI systems in production environments where both security and performance are essential.
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender