Evaluating LLMs in Traditional Chinese Medicine

TCM-3CEval introduces a novel triaxial benchmark designed to evaluate large language models' capabilities in Traditional Chinese Medicine across three critical dimensions.

Assesses core knowledge mastery, classical text understanding, and clinical decision-making in TCM
Evaluates diverse models including global LLMs (GPT-4o), Chinese models (InternLM), and medical-specific models (PLUSE)
Reveals a clear performance hierarchy among different model types

This research is significant for medical AI deployment as it addresses a critical gap in evaluating LLMs for traditional medicine contexts, potentially improving healthcare access and knowledge preservation in TCM practices.

TCM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine