
MirrorGuard: Adaptive Defense Against LLM Jailbreaks
Using entropy-guided mirror prompts to protect language models
MirrorGuard introduces a dynamic, adaptive defense mechanism that protects LLMs from sophisticated jailbreak attempts by creating mirror prompts that reveal harmful intent.
- Dynamic defense strategy that adapts to attack patterns instead of using rigid, predefined rules
- Entropy-guided approach that identifies potential attacks by analyzing response uncertainty
- Mirror crafting technique that prompts the model to reflect on and identify harmful requests
- Proven effectiveness against various jailbreak attack methods including multi-step and indirect attacks
This research is critical for security teams deploying LLMs in production environments where protecting against harmful use is essential for responsible AI deployment.
MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting