MirrorGuard: Adaptive Defense Against LLM Jailbreaks

MirrorGuard introduces a dynamic, adaptive defense mechanism that protects LLMs from sophisticated jailbreak attempts by creating mirror prompts that reveal harmful intent.

Dynamic defense strategy that adapts to attack patterns instead of using rigid, predefined rules
Entropy-guided approach that identifies potential attacks by analyzing response uncertainty
Mirror crafting technique that prompts the model to reflect on and identify harmful requests
Proven effectiveness against various jailbreak attack methods including multi-step and indirect attacks

This research is critical for security teams deploying LLMs in production environments where protecting against harmful use is essential for responsible AI deployment.

MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting