The Geometry of LLM Refusals

This research challenges the prevailing view of a single refusal direction in LLMs, revealing multiple independent refusal concepts that exist as concept cones in activation space.

LLMs contain multiple refusal directions rather than just one
Different refusal concepts (violence, sexuality, etc.) are representationally independent
The research introduces a novel gradient-based approach to representation engineering
These findings explain why some adversarial attacks easily bypass safety barriers

For security teams, this work provides crucial insights into how safety mechanisms actually function in LLMs, potentially enabling more robust defenses against adversarial attacks that attempt to circumvent content restrictions.

Original Paper: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence