JailGuard: Defending Against Prompt-Based Attacks

JailGuard: Defending Against Prompt-Based Attacks

A Universal Framework to Detect Jailbreaking and Hijacking Attempts

JailGuard introduces a comprehensive detection framework that identifies malicious prompts targeting LLM systems without requiring model retraining or fine-tuning.

Key Innovations:

  • Provides universal protection against both jailbreaking and hijacking attacks
  • Uses a two-stage detection approach that first filters inputs and then performs detailed analysis
  • Achieves detection without needing access to model parameters or architecture details
  • Demonstrates strong performance across different LLM systems and attack variations

This research is critical for security teams deploying LLM-powered applications, as it offers a practical solution to prevent harmful content generation and unauthorized task execution while maintaining system usability.

JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

4 | 157