
Identifying Safety-Critical Neurons in LLMs
Mapping the attention heads responsible for safety guardrails
This research identifies critical attention heads that safeguard LLMs from producing harmful content, offering insights into how safety mechanisms actually function.
- Introduces novel metrics (Ships and Sahara) to identify which attention heads are responsible for safety guardrails
- Demonstrates that removing specific attention heads compromises safety protections while preserving model performance
- Reveals that safety mechanisms are concentrated in a small subset of attention heads
- Provides a methodology to understand and potentially strengthen safety mechanisms in deployed models
For security teams, this research enables more targeted defense strategies by focusing on protecting the specific components that prevent harmful outputs rather than treating models as black boxes.
On the Role of Attention Heads in Large Language Model Safety