CURIE: Pushing the Boundaries of Scientific AI

CURIE introduces a comprehensive benchmark to test how well Large Language Models can understand, reason with, and extract information from lengthy scientific content.

Spans six scientific disciplines including biology and materials science
Contains 580 expert-curated problems requiring synthesis of information
Reveals current limitations in LLMs like GPT-4o for tasks such as protein sequencing
Establishes a framework for evaluating AI assistants in realistic scientific workflows

For biology researchers, CURIE provides crucial insights into how LLMs handle biodiversity analysis and protein-related tasks, highlighting both the potential and current limitations of AI for scientific discovery.

CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning