Testing LLM Robustness for Software Requirements

Testing LLM Robustness for Software Requirements

Evaluating consistency in NFR-aware code generation

This research introduces RobuNFR, a novel framework for evaluating how consistently large language models handle Non-Functional Requirements (NFRs) when generating code.

  • Evaluates LLM robustness across four NFR dimensions: design, readability, reliability, and performance
  • Uses three testing methodologies: prompt variation, regression testing, and diversity measurement
  • Reveals that even leading LLMs produce inconsistent code when users express the same NFRs differently
  • Provides a structured approach for developers to assess LLM reliability for enterprise software engineering

This research matters because it addresses a critical gap in software engineering practice: ensuring that AI code generators maintain consistency despite variations in how developers express requirements.

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

277 | 323