LLM Confidence in Medical Diagnosis

Study evaluating self-reported confidence of leading large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, Qwen) on gastroenterology assessments.

Top models (GPT-o1, GPT-4o, Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2
Newer models show improved performance but all exhibit overconfidence issues
Uncertainty estimation remains a significant challenge for medical applications

This research highlights critical concerns for clinical implementation of LLMs, emphasizing the need for improved calibration and reliability assessment before healthcare deployment.

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models