Restoring Safety in Fine-Tuned LLMs

Restoring Safety in Fine-Tuned LLMs

A post-hoc approach to recover safety alignment after fine-tuning

This research introduces IRR (Identify, Remove, and Recalibrate), a novel method to restore safety in language models that became unsafe during fine-tuning.

  • Identifies and surgically removes unsafe parameter changes introduced during fine-tuning
  • Maintains model performance on intended tasks while restoring safety guardrails
  • Demonstrates effectiveness against harmful queries and jailbreak attacks
  • Provides a practical solution for deploying fine-tuned models without compromising safety

This work addresses a critical security challenge for organizations using customized LLMs, offering a way to benefit from domain-specific fine-tuning without introducing new vulnerabilities or harmful behaviors.

Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

56 | 157