The Anchored Safety Problem in LLMs

This research reveals a critical vulnerability in LLM safety alignments: safety mechanisms tend to be concentrated in the template region between prompts and outputs, making them easy to bypass.

Key Findings:

Safety decision-making in LLMs overly relies on template regions rather than full contexts
Simple attacks can exploit this weakness by manipulating the template space
This vulnerability affects even models explicitly designed with safety alignments
Understanding this pattern can lead to more robust safety implementations

Security Implications: This discovery explains why seemingly well-safeguarded models remain vulnerable to jailbreaking attempts, highlighting the need for security approaches that extend beyond template regions.

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region