Breaking Language Barriers in Content Moderation

This research addresses the critical gap in offensive language detection capabilities between high and low-resource languages by introducing novel adaptation strategies for Sinhala.

Key Innovations:

Introduction of four new models including Subasa-XLM-R with intermediate pre-finetuning
Successful adaptation of language models for a low-resourced language
Novel fine-tuning techniques specifically optimized for Sinhala
Enhanced detection capabilities for offensive content moderation

Business Impact: These advancements enable more effective content moderation and social media safety systems in previously underserved linguistic communities, expanding the reach of security applications across language barriers.

Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala