Making LLMs Explain Themselves

Making LLMs Explain Themselves

Enhancing model explainability without external modules

SEER is a novel framework that enhances large language models' self-explainability by reorganizing their internal representations, improving transparency and reliability.

  • Creates concept-level representation by clustering similar tokens
  • Disentangles different concepts to reduce confusion between similar inputs
  • Demonstrates improved performance on safety classification and detoxification tasks
  • Provides more accurate and reliable explanations compared to external explainer methods

This research is particularly valuable for security applications, enabling better understanding of model decisions in high-stakes scenarios and reducing potential safety risks through improved transparency.

SEER: Self-Explainability Enhancement of Large Language Models' Representations

70 | 141