Simplifying LLM Unlearning

This research introduces Negative Preference Optimization (NPO), a simpler and more effective framework for removing unwanted content from large language models while preserving their performance.

Demonstrates that complete gradient reversal is unnecessary and often counterproductive for unlearning
Proposes a more controlled optimization approach that maintains model utility while effectively removing harmful content
Achieves better resistance against relearning attacks than traditional gradient ascent methods
Provides a mathematically sound framework that outperforms complex alternatives with a simpler technique

For security teams, this research offers a practical path to making AI systems safer by efficiently removing harmful or prohibited content without sacrificing model performance.

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning