Safety Across Languages: The Hidden Gap in LLM Alignment

Safety Across Languages: The Hidden Gap in LLM Alignment

How safety mechanisms transfer (or fail to transfer) across languages

This research reveals critical gaps in how LLM safety alignment, primarily developed in English, performs across multiple languages.

  • Current alignment tuning methods developed in English do not generalize equally to all languages
  • Researchers identified a distinct "safety space" within LLMs that constrains their outputs differently per language
  • Non-English languages often receive weaker safety constraints, creating security vulnerabilities
  • Findings suggest alignment methods need language-specific approaches rather than assuming English-based safety transfers universally

For security teams, this research highlights the importance of testing LLM safety in all deployment languages rather than assuming English safety evaluations are sufficient.

The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

17 | 20