Benchmarking LLM Summarization Power

Benchmarking LLM Summarization Power

A multi-dimensional evaluation across 17 large language models

This research provides a comprehensive evaluation framework for text summarization capabilities across leading commercial and open-source LLMs.

  • Tested 17 models (OpenAI, Google, Anthropic, open-source) on 7 diverse datasets
  • Evaluated at multiple output lengths (50, 100, 150 tokens)
  • Measured factual consistency and semantic similarity with novel metrics
  • Compared performance across specialized domains including medical content

For medical applications, this research helps identify which LLMs can most accurately summarize complex medical literature from PubMed, ensuring factual consistency critical for healthcare decision-making.

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

71 | 78