Interpretability for LLM Security

Research on understanding and explaining LLM internal states and mechanisms to improve security, detect vulnerabilities, and enable safer steering of model behavior

Hero image