Making LLMs Explain Themselves

SEER is a novel framework that enhances large language models' self-explainability by reorganizing their internal representations, improving transparency and reliability.

Creates concept-level representation by clustering similar tokens
Disentangles different concepts to reduce confusion between similar inputs
Demonstrates improved performance on safety classification and detoxification tasks
Provides more accurate and reliable explanations compared to external explainer methods

This research is particularly valuable for security applications, enabling better understanding of model decisions in high-stakes scenarios and reducing potential safety risks through improved transparency.

SEER: Self-Explainability Enhancement of Large Language Models' Representations