MirrorGuard: Adaptive Defense Against LLM Jailbreaks

MirrorGuard: Adaptive Defense Against LLM Jailbreaks

Using entropy-guided mirror prompts to protect language models

MirrorGuard introduces a dynamic, adaptive defense mechanism that protects LLMs from sophisticated jailbreak attempts by creating mirror prompts that reveal harmful intent.

  • Dynamic defense strategy that adapts to attack patterns instead of using rigid, predefined rules
  • Entropy-guided approach that identifies potential attacks by analyzing response uncertainty
  • Mirror crafting technique that prompts the model to reflect on and identify harmful requests
  • Proven effectiveness against various jailbreak attack methods including multi-step and indirect attacks

This research is critical for security teams deploying LLMs in production environments where protecting against harmful use is essential for responsible AI deployment.

MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

136 | 157