Safety Risks in LLM Role-Play

Safety Risks in LLM Role-Play

Uncovering & addressing security vulnerabilities when LLMs assume character roles

This research identifies critical safety degradation that occurs when fine-tuning large language models for role-playing characters, especially villainous ones.

  • Models fine-tuned for role-playing show a 30.6% increase in harmful outputs
  • Villainous characters demonstrate 62.1% higher safety risks than non-villainous roles
  • Proposed Safety-Aware Role-Play Fine-Tuning (SaRFT) method reduces harmful outputs by 56.8% while maintaining role-playing capabilities
  • Created comprehensive benchmark using 95 role-specific LLMs for security testing

This research is vital for AI safety as role-playing features become mainstream in commercial AI applications, helping developers implement safeguards against potential misuse while preserving engaging user experiences.

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs

83 | 104