Evaluating LLMs in Long-Context Scenarios

The ETHIC benchmark provides a robust method for evaluating how well large language models actually comprehend and utilize information in long texts.

Introduces Information Coverage Rate (ICR) to measure how thoroughly models process contextual information
Evaluates models across four domains: medical, legal, education, and linguistics
Reveals that existing evaluation methods may overestimate model performance on long-context tasks
Shows substantial performance gaps between standard and specialized models in domain-specific scenarios

For medical applications, this research highlights the importance of rigorous testing before deploying LLMs in clinical settings where missing critical information could impact patient care.

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage