
SynSUM: Bridging the Medical Data Gap
A synthetic benchmark connecting clinical notes with structured patient data
SynSUM offers a novel solution to the scarcity of paired structured and unstructured medical data by generating 10,000 synthetic patient records for respiratory diseases.
- Dual-format dataset combining tabular medical variables with corresponding clinical notes
- Bayesian network generation ensures realistic relationships between medical variables
- Focus on respiratory diseases with symptoms, diagnoses, and underlying conditions
- Enables development of clinical information extraction and reasoning systems without privacy concerns
This benchmark addresses a critical need in medical AI development where access to real patient data is limited by privacy regulations, allowing researchers to test and improve clinical NLP systems with realistic but synthetic data.
SynSUM — Synthetic Benchmark with Structured and Unstructured Medical Records