Evaluating Medical Accuracy in LLMs

This research evaluates LLMs' ability to recall and apply factual medical knowledge using structured one-hop judgment tasks from UMLS data.

Created a specialized benchmark isolating knowledge from reasoning abilities
Evaluated multiple LLMs including GPT-4, Claude, and Llama models
Found significant gaps between human expert knowledge and even advanced LLMs
Identified that LLMs still struggle with factual precision in high-stakes medical domains

This matters for medical applications because inaccurate information in healthcare settings can lead to critical consequences for patient safety. The findings highlight the need for careful validation before deploying LLMs in clinical decision support.

Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment