Selective Forgetting in Language Models

LUNAR introduces a breakthrough methodology for selectively removing knowledge from Large Language Models without degrading overall performance.

Neural Activation Redirection steers undesired knowledge representations away from the model's retrieval mechanisms
Leverages the Linear Representation Hypothesis to efficiently target specific information for removal
Demonstrates robust protection against white-box adversarial attacks
Maintains model performance on general tasks while effectively removing targeted information

Why it matters for security: As LLMs train on increasingly vast datasets, they risk memorizing and potentially leaking sensitive information. LUNAR provides a practical solution for maintaining privacy compliance while preserving model utility.

LUNAR: LLM Unlearning via Neural Activation Redirection