
Multilingual Safety for AI Assistants
Precision-targeting language-specific vulnerabilities in LLMs
Soteria introduces a lightweight approach to enhance LLM safety across multiple languages by targeting only the specific parameters responsible for harmful outputs in each language.
- Identifies and adjusts only the functional heads most responsible for generating harmful content
- Achieves significant safety improvements while modifying just a fraction of parameters
- Maintains model performance across languages, even in low-resource settings
- Introduces XThreatBench as a specialized benchmark for multilingual safety evaluation
This research addresses a critical security challenge for deploying LLMs globally, enabling organizations to implement targeted safety controls without compromising overall model utility or requiring complete retraining.
Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment