Rethinking LLM Testing

This research introduces a structured framework for testing Large Language Models (LLMs) and Multi-Agent systems that addresses their non-deterministic nature.

Identifies key variation points that impact test correctness for LLM-based systems
Demonstrates why traditional testing approaches are insufficient for LLM verification
Establishes a taxonomy for test case design informed by both research literature and practical experience
Bridges the gap between academic research and engineering practice in LLM testing

For engineering teams, this framework provides critical guidance on developing reliable verification methods for increasingly complex AI systems where simple output comparisons or statistical accuracy metrics no longer suffice.

Challenges in Testing Large Language Model Based Software: A Faceted Taxonomy