
CURIE: Pushing the Boundaries of Scientific AI
Evaluating LLMs on long scientific contexts across multiple disciplines
CURIE introduces a comprehensive benchmark to test how well Large Language Models can understand, reason with, and extract information from lengthy scientific content.
- Spans six scientific disciplines including biology and materials science
- Contains 580 expert-curated problems requiring synthesis of information
- Reveals current limitations in LLMs like GPT-4o for tasks such as protein sequencing
- Establishes a framework for evaluating AI assistants in realistic scientific workflows
For biology researchers, CURIE provides crucial insights into how LLMs handle biodiversity analysis and protein-related tasks, highlighting both the potential and current limitations of AI for scientific discovery.
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning