
Defending LLMs Against Harmful Fine-tuning
Simple random perturbations outperform complex defenses
Panacea introduces a surprisingly effective defense against harmful fine-tuning attacks by applying random perturbations to model weights after fine-tuning.
- Conventional defenses that try to "vaccinate" models are fragile and can be overcome with additional fine-tuning steps
- Adding random noise to model parameters after fine-tuning effectively neutralizes harmful behaviors
- The approach is computationally efficient and requires no additional training data
- Extensive experiments demonstrate the method works across multiple model architectures and harmful tasks
This research matters for security by providing a practical, resource-efficient way to protect public fine-tuning services from being weaponized for harmful applications.
Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation