AdaSteer: Adaptive Defense Against LLM Jailbreaks

AdaSteer introduces a training-free method that dynamically adjusts model behavior to defend against jailbreak attacks while maintaining performance on legitimate requests.

Automatically calibrates defense strength based on input characteristics
Significantly outperforms fixed-coefficient steering approaches
Reduces false rejections of benign inputs while maintaining strong protection
Operates as a lightweight security layer without requiring model retraining

This innovation addresses a critical vulnerability in aligned language models, providing a practical solution for deploying safer AI systems in production environments where both security and performance are essential.

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender