
The Illusion of Medical AI Competence
Why LLMs excel at multiple-choice but struggle with open-ended medical questions
This research reveals a significant gap between how Large Language Models perform on medical multiple-choice questions versus free-response formats.
- LLMs show artificially inflated performance on multiple-choice medical tests compared to free-response versions
- Models exploit structural patterns and eliminate unlikely options rather than demonstrating true medical knowledge
- Even advanced models like GPT-4o show a dramatic 37% performance drop when question format changes
- The study introduces FreeMedQA, a novel benchmark pairing MCQs with equivalent free-response questions
For medical education and clinical applications, these findings challenge the validity of using multiple-choice tests to evaluate AI medical capabilities, suggesting caution when deploying LLMs in healthcare settings.
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education