The Illusion of Medical AI Competence

The Illusion of Medical AI Competence

Why LLMs excel at multiple-choice but struggle with open-ended medical questions

This research reveals a significant gap between how Large Language Models perform on medical multiple-choice questions versus free-response formats.

  • LLMs show artificially inflated performance on multiple-choice medical tests compared to free-response versions
  • Models exploit structural patterns and eliminate unlikely options rather than demonstrating true medical knowledge
  • Even advanced models like GPT-4o show a dramatic 37% performance drop when question format changes
  • The study introduces FreeMedQA, a novel benchmark pairing MCQs with equivalent free-response questions

For medical education and clinical applications, these findings challenge the validity of using multiple-choice tests to evaluate AI medical capabilities, suggesting caution when deploying LLMs in healthcare settings.

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

62 | 85