
Making LLMs Explain Themselves
Enhancing model explainability without external modules
SEER is a novel framework that enhances large language models' self-explainability by reorganizing their internal representations, improving transparency and reliability.
- Creates concept-level representation by clustering similar tokens
- Disentangles different concepts to reduce confusion between similar inputs
- Demonstrates improved performance on safety classification and detoxification tasks
- Provides more accurate and reliable explanations compared to external explainer methods
This research is particularly valuable for security applications, enabling better understanding of model decisions in high-stakes scenarios and reducing potential safety risks through improved transparency.
SEER: Self-Explainability Enhancement of Large Language Models' Representations