Unlocking Transparency in LLMs

This research demonstrates how Sparse Autoencoders (SAEs) can extract interpretable features from large language models, enabling more transparent AI applications.

Systematic framework for optimizing SAE configurations including model-layer selection and architectural designs
Enhanced interpretability for safety-critical classification tasks in medical and security domains
Transferable features that work across different model architectures, improving deployment flexibility
Scalable approach that adapts to various model sizes and complexities

For medical applications, this research is crucial as it provides the explainability and transparency necessary for clinical adoption, regulatory compliance, and building trust in AI-based diagnostic and decision support tools.

Sparse Autoencoder Features for Classifications and Transferability