
Evaluating LLMs in Long-Context Scenarios
New benchmark reveals gaps in how we test model information retention
The ETHIC benchmark provides a robust method for evaluating how well large language models actually comprehend and utilize information in long texts.
- Introduces Information Coverage Rate (ICR) to measure how thoroughly models process contextual information
- Evaluates models across four domains: medical, legal, education, and linguistics
- Reveals that existing evaluation methods may overestimate model performance on long-context tasks
- Shows substantial performance gaps between standard and specialized models in domain-specific scenarios
For medical applications, this research highlights the importance of rigorous testing before deploying LLMs in clinical settings where missing critical information could impact patient care.
ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage