
Benchmarking LLM Summarization Power
A multi-dimensional evaluation across 17 large language models
This research provides a comprehensive evaluation framework for text summarization capabilities across leading commercial and open-source LLMs.
- Tested 17 models (OpenAI, Google, Anthropic, open-source) on 7 diverse datasets
- Evaluated at multiple output lengths (50, 100, 150 tokens)
- Measured factual consistency and semantic similarity with novel metrics
- Compared performance across specialized domains including medical content
For medical applications, this research helps identify which LLMs can most accurately summarize complex medical literature from PubMed, ensuring factual consistency critical for healthcare decision-making.