Defending LLMs Against Jailbreak Attacks

LightDefense offers a resource-efficient solution to protect Large Language Models from jailbreak attacks by strategically adjusting token probabilities.

Shifts token distribution to prioritize safety disclaimers without additional model training
Targets white-box models with a safety-oriented approach to vocabulary management
Maintains model functionality while significantly improving defense against harmful prompts
Reduces dependency on extensive data collection or auxiliary models

This research provides security teams with a practical, implementable defense mechanism that can be deployed to protect AI systems without compromising performance or requiring substantial resources.

LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution