Making LLMs Interpretable and Steerable

This research advances our ability to understand and control large language model behavior by analyzing internal representations with sparse autoencoders.

Proposes MI-based approach to interpret sparse features that capture meaningful language patterns
Develops techniques to steer LLM behavior by manipulating identified features
Demonstrates effectiveness in defending against jailbreak attacks and improving output quality
Creates an interactive tool for exploring and manipulating model internal states

This work matters for security professionals by providing mechanisms to identify and mitigate potentially harmful outputs before they occur, enhancing LLM safety without retraining models.

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders