CURIE: Pushing the Boundaries of Scientific AI

CURIE: Pushing the Boundaries of Scientific AI

Evaluating LLMs on long scientific contexts across multiple disciplines

CURIE introduces a comprehensive benchmark to test how well Large Language Models can understand, reason with, and extract information from lengthy scientific content.

  • Spans six scientific disciplines including biology and materials science
  • Contains 580 expert-curated problems requiring synthesis of information
  • Reveals current limitations in LLMs like GPT-4o for tasks such as protein sequencing
  • Establishes a framework for evaluating AI assistants in realistic scientific workflows

For biology researchers, CURIE provides crucial insights into how LLMs handle biodiversity analysis and protein-related tasks, highlighting both the potential and current limitations of AI for scientific discovery.

CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

63 | 85