
Data Quality's Critical Impact on Healthcare AI
How textual data errors affect ML model performance in medical settings
This research quantifies how errors in medical text data significantly impair machine learning model performance and feature representation quality.
- Error rate metrics were developed to evaluate textual data quality at the token level
- Mixtral LLM was used to detect and correct errors in low-quality medical datasets
- Embeddings analysis revealed that data quality directly impacts feature representation and downstream ML model performance
- Correction strategies demonstrated improved model performance, particularly for healthcare datasets
For healthcare organizations: This research highlights the importance of implementing data quality controls before deploying ML systems with patient data, potentially improving clinical decision support and reducing errors.