T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning

T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning

A targeted layer-wise defense approach for enhanced safety alignment

T-Vaccine introduces a more efficient approach to protect large language models from harmful fine-tuning attacks, which can manipulate models to generate dangerous content.

  • Uses targeted layer-wise perturbation instead of uniform perturbation across all layers
  • Reduces unnecessary perturbations in safety-irrelevant layers, improving overall performance
  • Decreases memory consumption while maintaining robust defense capabilities
  • Demonstrates superior protection against attacks that attempt to bypass safety guardrails

This research advances LLM security by providing a more resource-efficient defense mechanism that maintains model utility while preventing malicious manipulation, essential for secure deployment of fine-tuning services.

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation

44 | 157