Multilingual Safety for AI Assistants

Soteria introduces a lightweight approach to enhance LLM safety across multiple languages by targeting only the specific parameters responsible for harmful outputs in each language.

Identifies and adjusts only the functional heads most responsible for generating harmful content
Achieves significant safety improvements while modifying just a fraction of parameters
Maintains model performance across languages, even in low-resource settings
Introduces XThreatBench as a specialized benchmark for multilingual safety evaluation

This research addresses a critical security challenge for deploying LLMs globally, enabling organizations to implement targeted safety controls without compromising overall model utility or requiring complete retraining.

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment