Sentinel Shield for LLM Security

STShield introduces a lightweight, efficient framework that enables LLMs to self-detect jailbreak attempts in real-time by appending a binary safety indicator to responses.

Leverages the model's own alignment capabilities without requiring external models
Achieves over 93% detection accuracy while maintaining low computational overhead
Demonstrates resilience against adaptive attacks compared to existing approaches
Provides a practical security solution that scales with minimal performance impact

This research addresses critical security vulnerabilities in deployed LLMs, offering organizations a cost-effective way to enhance safety without sacrificing performance or requiring complex infrastructure.

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models