Detecting LLM Jailbreaks through Geometry

Detecting LLM Jailbreaks through Geometry

A novel defense framework against adversarial prompts

CurvaLID is a new security framework that identifies adversarial prompts by analyzing their geometric properties in LLM embedding spaces, enabling more secure AI deployment.

  • Leverages the distinct curvature profiles of malicious prompts to detect attacks
  • Operates as a pre-processing filter without modifying the underlying model
  • Achieves high detection accuracy while maintaining performance on legitimate prompts
  • Provides a computationally efficient defense mechanism suitable for real-world applications

This research is critical for security as it addresses a fundamental vulnerability in LLMs, potentially preventing malicious actors from circumventing AI safety measures while preserving model functionality for legitimate users.

CURVALID: Geometrically-guided Adversarial Prompt Detection

122 | 157