The Illusion of Medical AI Competence

This research reveals a significant gap between how Large Language Models perform on medical multiple-choice questions versus free-response formats.

LLMs show artificially inflated performance on multiple-choice medical tests compared to free-response versions
Models exploit structural patterns and eliminate unlikely options rather than demonstrating true medical knowledge
Even advanced models like GPT-4o show a dramatic 37% performance drop when question format changes
The study introduces FreeMedQA, a novel benchmark pairing MCQs with equivalent free-response questions

For medical education and clinical applications, these findings challenge the validity of using multiple-choice tests to evaluate AI medical capabilities, suggesting caution when deploying LLMs in healthcare settings.

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education