Unlocking Multilingual Medical Data

E3C-3.0 is a new multilingual medical dataset that enables information extraction from clinical cases across multiple languages with limited resources.

Annotates diseases and test-result relations in five native languages (English, French, Italian, Spanish, Basque)
Extends coverage to five additional languages through translation and projection
Employs a semi-automatic approach with LLM-based annotation projection
Facilitates cross-lingual medical information extraction in resource-constrained settings

This research addresses critical gaps in medical NLP by providing standardized multilingual datasets, enabling development of clinical information systems for underrepresented languages and improving healthcare information access globally.

Low-resource Information Extraction with the European Clinical Case Corpus