
Selective Forgetting in Language Models
A novel approach to removing private information from LLMs
LUNAR introduces a breakthrough methodology for selectively removing knowledge from Large Language Models without degrading overall performance.
- Neural Activation Redirection steers undesired knowledge representations away from the model's retrieval mechanisms
- Leverages the Linear Representation Hypothesis to efficiently target specific information for removal
- Demonstrates robust protection against white-box adversarial attacks
- Maintains model performance on general tasks while effectively removing targeted information
Why it matters for security: As LLMs train on increasingly vast datasets, they risk memorizing and potentially leaking sensitive information. LUNAR provides a practical solution for maintaining privacy compliance while preserving model utility.