Data Quality's Critical Impact on Healthcare AI

Data Quality's Critical Impact on Healthcare AI

How textual data errors affect ML model performance in medical settings

This research quantifies how errors in medical text data significantly impair machine learning model performance and feature representation quality.

  • Error rate metrics were developed to evaluate textual data quality at the token level
  • Mixtral LLM was used to detect and correct errors in low-quality medical datasets
  • Embeddings analysis revealed that data quality directly impacts feature representation and downstream ML model performance
  • Correction strategies demonstrated improved model performance, particularly for healthcare datasets

For healthcare organizations: This research highlights the importance of implementing data quality controls before deploying ML systems with patient data, potentially improving clinical decision support and reducing errors.

Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models

41 | 108