
T-Vaccine: Safeguarding LLMs Against Harmful Fine-Tuning
A targeted layer-wise defense approach for enhanced safety alignment
T-Vaccine introduces a more efficient approach to protect large language models from harmful fine-tuning attacks, which can manipulate models to generate dangerous content.
- Uses targeted layer-wise perturbation instead of uniform perturbation across all layers
- Reduces unnecessary perturbations in safety-irrelevant layers, improving overall performance
- Decreases memory consumption while maintaining robust defense capabilities
- Demonstrates superior protection against attacks that attempt to bypass safety guardrails
This research advances LLM security by providing a more resource-efficient defense mechanism that maintains model utility while preventing malicious manipulation, essential for secure deployment of fine-tuning services.