
Unlocking Multilingual Medical Data
Building Low-Resource Information Extraction for Clinical Cases
E3C-3.0 is a new multilingual medical dataset that enables information extraction from clinical cases across multiple languages with limited resources.
- Annotates diseases and test-result relations in five native languages (English, French, Italian, Spanish, Basque)
- Extends coverage to five additional languages through translation and projection
- Employs a semi-automatic approach with LLM-based annotation projection
- Facilitates cross-lingual medical information extraction in resource-constrained settings
This research addresses critical gaps in medical NLP by providing standardized multilingual datasets, enabling development of clinical information systems for underrepresented languages and improving healthcare information access globally.
Low-resource Information Extraction with the European Clinical Case Corpus