Defending LLMs Against Harmful Fine-tuning

Panacea introduces a surprisingly effective defense against harmful fine-tuning attacks by applying random perturbations to model weights after fine-tuning.

Conventional defenses that try to "vaccinate" models are fragile and can be overcome with additional fine-tuning steps
Adding random noise to model parameters after fine-tuning effectively neutralizes harmful behaviors
The approach is computationally efficient and requires no additional training data
Extensive experiments demonstrate the method works across multiple model architectures and harmful tasks

This research matters for security by providing a practical, resource-efficient way to protect public fine-tuning services from being weaponized for harmful applications.

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation