
LLM Confidence in Medical Diagnosis
Evaluating AI Reliability in Gastroenterology
Study evaluating self-reported confidence of leading large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, Qwen) on gastroenterology assessments.
- Top models (GPT-o1, GPT-4o, Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2
- Newer models show improved performance but all exhibit overconfidence issues
- Uncertainty estimation remains a significant challenge for medical applications
This research highlights critical concerns for clinical implementation of LLMs, emphasizing the need for improved calibration and reliability assessment before healthcare deployment.