Interpretability for LLM Security

Research on understanding and explaining LLM internal states and mechanisms to improve security, detect vulnerabilities, and enable safer steering of model behavior

This presentation covers 12 research papers on large language models applied to Interpretability for LLM Security.

1 | 14