T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning

T-Vaccine introduces a more efficient approach to protect large language models from harmful fine-tuning attacks, which can manipulate models to generate dangerous content.

Uses targeted layer-wise perturbation instead of uniform perturbation across all layers
Reduces unnecessary perturbations in safety-irrelevant layers, improving overall performance
Decreases memory consumption while maintaining robust defense capabilities
Demonstrates superior protection against attacks that attempt to bypass safety guardrails

This research advances LLM security by providing a more resource-efficient defense mechanism that maintains model utility while preventing malicious manipulation, essential for secure deployment of fine-tuning services.

Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation