Restoring Safety in Fine-Tuned LLMs

This research introduces IRR (Identify, Remove, and Recalibrate), a novel method to restore safety in language models that became unsafe during fine-tuning.

Identifies and surgically removes unsafe parameter changes introduced during fine-tuning
Maintains model performance on intended tasks while restoring safety guardrails
Demonstrates effectiveness against harmful queries and jailbreak attacks
Provides a practical solution for deploying fine-tuned models without compromising safety

This work addresses a critical security challenge for organizations using customized LLMs, offering a way to benefit from domain-specific fine-tuning without introducing new vulnerabilities or harmful behaviors.

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models