Multilingual Fact-Checking at Scale

Multilingual Fact-Checking at Scale

Using LLMs to Generate High-Quality Training Data for Multiple Languages

MultiSynFact is a groundbreaking dataset of 2.2M claim-source pairs that expands fact-checking capabilities beyond English into Spanish, German, and other low-resource languages.

Key innovations:

  • First large-scale multilingual fact-checking dataset with 2.2M claim-source pairs
  • Novel LLM-based data generation pipeline integrating Wikipedia knowledge
  • Supports fact-checking in Spanish, German, English, and low-resource languages
  • Addresses critical security gap in multilingual misinformation detection

Security impact: By enabling robust multilingual fact-checking systems, this research provides essential tools for combating misinformation at global scale—critical for information security across language barriers.

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

3 | 8