Identifying Safety-Critical Neurons in LLMs

This research identifies critical attention heads that safeguard LLMs from producing harmful content, offering insights into how safety mechanisms actually function.

Introduces novel metrics (Ships and Sahara) to identify which attention heads are responsible for safety guardrails
Demonstrates that removing specific attention heads compromises safety protections while preserving model performance
Reveals that safety mechanisms are concentrated in a small subset of attention heads
Provides a methodology to understand and potentially strengthen safety mechanisms in deployed models

For security teams, this research enables more targeted defense strategies by focusing on protecting the specific components that prevent harmful outputs rather than treating models as black boxes.

On the Role of Attention Heads in Large Language Model Safety