Defending LLMs Against Harmful Fine-tuning

Defending LLMs Against Harmful Fine-tuning

Simple random perturbations outperform complex defenses

Panacea introduces a surprisingly effective defense against harmful fine-tuning attacks by applying random perturbations to model weights after fine-tuning.

  • Conventional defenses that try to "vaccinate" models are fragile and can be overcome with additional fine-tuning steps
  • Adding random noise to model parameters after fine-tuning effectively neutralizes harmful behaviors
  • The approach is computationally efficient and requires no additional training data
  • Extensive experiments demonstrate the method works across multiple model architectures and harmful tasks

This research matters for security by providing a practical, resource-efficient way to protect public fine-tuning services from being weaponized for harmful applications.

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

70 | 157