
Testing LLM Robustness for Software Requirements
Evaluating consistency in NFR-aware code generation
This research introduces RobuNFR, a novel framework for evaluating how consistently large language models handle Non-Functional Requirements (NFRs) when generating code.
- Evaluates LLM robustness across four NFR dimensions: design, readability, reliability, and performance
- Uses three testing methodologies: prompt variation, regression testing, and diversity measurement
- Reveals that even leading LLMs produce inconsistent code when users express the same NFRs differently
- Provides a structured approach for developers to assess LLM reliability for enterprise software engineering
This research matters because it addresses a critical gap in software engineering practice: ensuring that AI code generators maintain consistency despite variations in how developers express requirements.