Data Quality's Critical Impact on Healthcare AI

This research quantifies how errors in medical text data significantly impair machine learning model performance and feature representation quality.

Error rate metrics were developed to evaluate textual data quality at the token level
Mixtral LLM was used to detect and correct errors in low-quality medical datasets
Embeddings analysis revealed that data quality directly impacts feature representation and downstream ML model performance
Correction strategies demonstrated improved model performance, particularly for healthcare datasets

For healthcare organizations: This research highlights the importance of implementing data quality controls before deploying ML systems with patient data, potentially improving clinical decision support and reducing errors.

Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models