Understanding LLM Service Failures

This research introduces FAILS, a novel framework that systematically captures and analyzes failures in Large Language Model services to improve reliability.

Creates a first-of-its-kind dataset of LLM service incidents from public sources
Enables automated collection and classification of service disruptions
Provides insights into failure patterns across different LLM providers
Establishes a foundation for better reliability engineering in AI systems

For engineering teams, this research offers critical insights into how LLM services fail in production, helping to design more robust systems and implement effective mitigation strategies.

FAILS: A Framework for Automated Collection and Analysis of LLM Service Incidents