
Sentinel Shield for LLM Security
Real-time jailbreak detection with a single-token approach
STShield introduces a lightweight, efficient framework that enables LLMs to self-detect jailbreak attempts in real-time by appending a binary safety indicator to responses.
- Leverages the model's own alignment capabilities without requiring external models
- Achieves over 93% detection accuracy while maintaining low computational overhead
- Demonstrates resilience against adaptive attacks compared to existing approaches
- Provides a practical security solution that scales with minimal performance impact
This research addresses critical security vulnerabilities in deployed LLMs, offering organizations a cost-effective way to enhance safety without sacrificing performance or requiring complex infrastructure.
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models