
Making LLMs Interpretable and Steerable
Using Mutual Information to Explain Sparse Autoencoder Features
This research advances our ability to understand and control large language model behavior by analyzing internal representations with sparse autoencoders.
- Proposes MI-based approach to interpret sparse features that capture meaningful language patterns
- Develops techniques to steer LLM behavior by manipulating identified features
- Demonstrates effectiveness in defending against jailbreak attacks and improving output quality
- Creates an interactive tool for exploring and manipulating model internal states
This work matters for security professionals by providing mechanisms to identify and mitigate potentially harmful outputs before they occur, enhancing LLM safety without retraining models.
Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders