Evaluating Health Information in Chinese LLMs

CHBench introduces the first safety-focused benchmark to evaluate Large Language Models' performance on Chinese health-related inquiries, addressing a critical gap in LLM assessment.

Comprehensively assesses physical and mental health capabilities in Chinese LLMs
Identifies potential risks of medical misinformation in AI responses
Provides a standardized framework for measuring healthcare safety in Chinese language models
Emphasizes the real-world consequences of inaccurate health information in AI systems

This research is vital for the responsible deployment of AI in healthcare contexts where incorrect information could lead to serious patient harm, particularly in Chinese-speaking regions where specialized evaluation tools have been lacking.

CHBench: A Chinese Dataset for Evaluating Health in Large Language Models