
JailGuard: Defending Against Prompt-Based Attacks
A Universal Framework to Detect Jailbreaking and Hijacking Attempts
JailGuard introduces a comprehensive detection framework that identifies malicious prompts targeting LLM systems without requiring model retraining or fine-tuning.
Key Innovations:
- Provides universal protection against both jailbreaking and hijacking attacks
- Uses a two-stage detection approach that first filters inputs and then performs detailed analysis
- Achieves detection without needing access to model parameters or architecture details
- Demonstrates strong performance across different LLM systems and attack variations
This research is critical for security teams deploying LLM-powered applications, as it offers a practical solution to prevent harmful content generation and unauthorized task execution while maintaining system usability.
JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks