Evaluating LLMs in Software Engineering

New guidelines for conducting empirical studies involving Large Language Models in software engineering research, addressing the current lack of standardized evaluation approaches.

Establishes methodological standards for LLM-based software engineering research
Provides a structured framework to ensure validity and reproducibility
Addresses unique challenges of using LLMs in empirical studies
Promotes scientific rigor in a rapidly evolving research landscape

This research is crucial for engineering teams as it enables more reliable evaluation of LLM capabilities, ensuring that implementation decisions are based on trustworthy evidence rather than hype.

Towards Evaluation Guidelines for Empirical Studies involving LLMs