The Coreset Effect in LLM Unlearning

The Coreset Effect in LLM Unlearning

Why current unlearning benchmarks may be easier than they appear

This research reveals that popular LLM unlearning benchmarks (WMDP and MUSE) may exhibit a significant coreset effect, where removing a small subset of data is sufficient for effective unlearning.

Key Findings:

  • Current benchmarks may not adequately test true unlearning capabilities
  • Simple coreset methods can achieve strong performance on existing benchmarks
  • More robust benchmarks are needed to properly evaluate unlearning techniques
  • Findings have implications for AI safety and controlled model behavior

Why It Matters: As LLMs become more widespread, ensuring safe and controlled behavior is critical. This research highlights potential weaknesses in how we evaluate unlearning methods, which are essential for removing harmful knowledge or capabilities from deployed models.

LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

48 | 51