LLM Confidence in Medical Diagnosis

LLM Confidence in Medical Diagnosis

Evaluating AI Reliability in Gastroenterology

Study evaluating self-reported confidence of leading large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, Qwen) on gastroenterology assessments.

  • Top models (GPT-o1, GPT-4o, Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2
  • Newer models show improved performance but all exhibit overconfidence issues
  • Uncertainty estimation remains a significant challenge for medical applications

This research highlights critical concerns for clinical implementation of LLMs, emphasizing the need for improved calibration and reliability assessment before healthcare deployment.

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

67 | 85