The Coreset Effect in LLM Unlearning

This research reveals that popular LLM unlearning benchmarks (WMDP and MUSE) may exhibit a significant coreset effect, where removing a small subset of data is sufficient for effective unlearning.

Key Findings:

Current benchmarks may not adequately test true unlearning capabilities
Simple coreset methods can achieve strong performance on existing benchmarks
More robust benchmarks are needed to properly evaluate unlearning techniques
Findings have implications for AI safety and controlled model behavior

Why It Matters: As LLMs become more widespread, ensuring safe and controlled behavior is critical. This research highlights potential weaknesses in how we evaluate unlearning methods, which are essential for removing harmful knowledge or capabilities from deployed models.

LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks