Evaluation and Performance Assessment of LLMs in Healthcare
Research evaluating the capabilities and limitations of LLMs in medical contexts

Evaluation and Performance Assessment of LLMs in Healthcare
Research on Large Language Models in Evaluation and Performance Assessment of LLMs in Healthcare

ChatGPT's Untapped Potential in Healthcare
A comprehensive analysis of capabilities, limitations, and future medical applications

MedAlpaca: Open-Source Medical AI for Healthcare
Privacy-Preserving Medical Language Models & Training Data

LLMs in Healthcare: Promise and Responsibility
Navigating the landscape of AI-powered medical solutions

LLMs Revolutionizing Bioinformatics
Transforming biological data analysis beyond natural language processing

AI-Powered Therapeutic Alliance Assessment
Automating alliance measurement in psychotherapy through language analysis

AI Transforming Thyroid Cancer Diagnosis
Leveraging Transformers and Machine Learning for Improved Clinical Outcomes

Right-Sizing AI for Medical Records
Medium-sized language models offer practical alternatives to LLMs in healthcare

LLMs in Public Health: Performance Evaluation
Comprehensive assessment of AI models for health classification and data extraction

CCoE: Multi-Expert Collaboration for Efficient LLMs
Maximizing LLM capabilities in resource-constrained environments

Health Information: Search Engines vs. LLMs
Comparing information sources for accurate health answers

Smarter LLMs Through Reranking
Using communication theory to reduce hallucinations and improve output quality

Beyond Academia: Testing LLMs on Real Professional Exams
Evaluating AI language models on vocational and professional certification standards

Cost-Effective Medical AI with Open-Source LLMs
Achieving premium AI healthcare solutions at a fraction of the cost

Evaluating Health Information in Chinese LLMs
First comprehensive benchmark for safety assessment in Chinese healthcare AI

Combating Bias in Medical AI
A scalable framework for evaluating bias in medical LLMs

LLMs for RNA Structure Prediction
Benchmarking language models for critical RNA biology applications

Evaluating LLMs in Long-Context Scenarios
New benchmark reveals gaps in how we test model information retention

Confidence in LLM Rankings
A statistical framework to assess uncertainty when evaluating AI models

Smarter AI Vision: Measuring Uncertainty
Enhancing VLMs with probabilistic reasoning for safer AI applications

Benchmarking LLMs in Pediatric Care
First comprehensive Chinese pediatric dataset for evaluating medical AI

Bridging the Vision Gap
Evaluating How MLLMs See Compared to Humans

Detecting Hallucinations in AI Radiology
Fine-grained approach for safer AI-generated medical reports

Combating Medical AI Hallucinations
A new benchmark to evaluate and reduce false information in medical AI systems

Combating LVLM Hallucinations
A new benchmark for detecting AI visual falsehoods

Mapping the Frontiers of LLMs
Understanding the Capabilities and Limitations of Large Language Models

Can LLMs Replace Human Annotators?
A Statistical Framework for Validating AI Judges

MedAgentBench: Virtual EHR Testing Ground for LLMs
First standardized benchmark for evaluating medical LLM agents in realistic healthcare environments

Evaluating AI Across Industries
A comprehensive benchmark for multimodal AI in industrial settings

Fair Pricing for LLM Training Data
A data valuation framework that ensures equitable compensation for data contributors

Benchmarking LLMs in Ophthalmology
A comprehensive evaluation framework for Chinese eye care applications

LLMs as Sentiment Analysis Experts
Evaluating AI accuracy in tobacco product sentiment analysis

Evaluating AI Chatbots for Menopause Support
A mixed-methods approach to assessing medical accuracy and reliability

Beyond Yes/No: Rethinking Healthcare LLM Evaluation
A comprehensive approach to assessing medical AI assistants

LLMs and Medical Misinformation
How language models interpret spin in clinical research

The Sycophancy Problem in AI
LLMs Prioritize Agreement Over Accuracy

Theory of Mind in AI Systems
How LLMs Understand Human Mental States

On-Device LLMs for Medical Privacy
Evaluating edge computing models for clinical reasoning without cloud dependence

Hallucination-Free AI: Comparing RAG, LoRA & DoRA
Comprehensive accuracy evaluation across critical domains

Healthcare's AI Revolution: LLMs in Medicine
Systematic review of training, customization, and evaluation techniques

Improving LLM Confidence Evaluation
A new benchmark for accurately assessing when AI systems should trust their own outputs

Evaluating Medical Accuracy in LLMs
Isolating factual medical knowledge from reasoning capabilities

Detecting Medical Hallucinations in LLMs
First benchmark to evaluate medical misinformation in AI models

Combating LLM Hallucinations
Beyond Self-Consistency: A Cross-Model Verification Approach

GraphCheck: Enhancing LLM Accuracy in Critical Applications
Knowledge Graph-Based Fact-Checking for Medical Content

Multi-Dimensional Uncertainty in LLMs
Beyond Semantic Similarity for More Reliable AI Systems

Transforming Healthcare with AI
How Medical Large Models Are Revolutionizing Patient Care

DeepSeek-R1 Leads in Medical AI Reasoning
Outperforming Gemini and OpenAI models in bilingual ophthalmology tests

Improving LLM Reliability in Social Sciences
Applying survey methodology to enhance AI text annotation

Fact-Checking Vision-Language Models
A statistical framework for reducing hallucinations in AI image interpretation

DeepSeek Models in Biomedical NLP
Evaluating cutting-edge LLMs for specialized medical text analysis

LLMs Revolutionizing Healthcare Text Classification
Systematic Review of AI Applications in Clinical Documentation

Benchmarking LLMs for Psychiatric Practice
Evaluating AI's potential in mental healthcare through comprehensive assessment

Combating Hallucinations in Medical AI
A systematic benchmark for evaluating and mitigating medical LVLM hallucinations

Enhancing Medical AI with Structured Outputs
How structured formats transform LLMs into reliable medical experts

Evaluating LLM Reasoning in Clinical Settings
New benchmark reveals how AI models perform on real medical cases

Detecting AI Hallucinations with Semantic Clustering
A novel uncertainty-based framework for identifying factual inaccuracies in LLMs

Evaluating LLMs in Traditional Chinese Medicine
The first comprehensive benchmark for assessing AI models in TCM contexts

Improving Medical AI Accuracy
A Framework for Understanding and Fixing LLM Errors in Healthcare

Rethinking Medical LLM Benchmarks
Moving beyond leaderboard competition to meaningful clinical evaluation

Smarter LLM Evaluation Methods
Making comprehensive AI assessment faster and more reliable

The Illusion of Medical AI Competence
Why LLMs excel at multiple-choice but struggle with open-ended medical questions

CURIE: Pushing the Boundaries of Scientific AI
Evaluating LLMs on long scientific contexts across multiple disciplines

LLM Battle: Llama3 vs DeepSeekR1 in Medical Text Analysis
Comparing open-source LLMs on biomedical classification tasks

The Rise of Large Language Models
How ChatGPT is transforming industries

LLMs as Medical Assistants
A comprehensive benchmark for evaluating LLMs in primary healthcare

LLM Confidence in Medical Diagnosis
Evaluating AI Reliability in Gastroenterology

Combating Medical AI Hallucinations
Vision-Enhanced Detection System for Medical Visual Q&A

Consistency Matters: LLMs in Sequential Interactions
New metrics and methods to ensure reliable AI responses over multiple turns

When LLMs Get Medical Advice Wrong
How user inputs compromise AI reliability in healthcare

Expanding AI's Potential with Verifiable Rewards
How RLVR Extends LLM Performance Beyond Coding to Real-World Domains

DeepSeek R1's Clinical Reasoning Capabilities
93% Diagnostic Accuracy in Clinical Case Evaluation

RECKON: Revolutionizing LLM Knowledge Evaluation
A reference-based approach for efficient, scalable assessment

The Distracted Doctor Problem
How noise and irrelevant information impair medical LLMs

Advancing LLM Capabilities in Specialized Medicine
Systematic evaluation of AI reasoning in anesthesiology

Taming Uncertainty in LLM Sentiment Analysis
Addressing variability challenges for more reliable AI decisions

Mastering Multi-Turn LLM Interactions
Moving beyond single-turn capabilities for real-world applications

OrderChain: Enhancing Visual Reasoning in AI Models
A novel prompting approach that dramatically improves ordinal classification in multimodal models

Embracing Uncertainty in Medical AI
Rethinking how LLMs communicate uncertainty in healthcare

Rethinking Personality Assessment for LLMs
Moving Beyond Self-Reports with Multi-Observer Framework

Detecting Medical Hallucinations in AI
MedHal: A breakthrough dataset for evaluating hallucination detection in medical contexts

LLMs in Medicine: Diagnosis & Treatment Support
Evaluating AI models on real-world medical certification exams

Next-Gen LLMs in Ophthalmology
Head-to-head evaluation reveals performance gaps in medical reasoning
