Testing LLM Robustness for Software Requirements

This research introduces RobuNFR, a novel framework for evaluating how consistently large language models handle Non-Functional Requirements (NFRs) when generating code.

Evaluates LLM robustness across four NFR dimensions: design, readability, reliability, and performance
Uses three testing methodologies: prompt variation, regression testing, and diversity measurement
Reveals that even leading LLMs produce inconsistent code when users express the same NFRs differently
Provides a structured approach for developers to assess LLM reliability for enterprise software engineering

This research matters because it addresses a critical gap in software engineering practice: ensuring that AI code generators maintain consistency despite variations in how developers express requirements.

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation