
Evaluating LLMs in Software Engineering
A systematic framework for rigorous empirical research
New guidelines for conducting empirical studies involving Large Language Models in software engineering research, addressing the current lack of standardized evaluation approaches.
- Establishes methodological standards for LLM-based software engineering research
- Provides a structured framework to ensure validity and reproducibility
- Addresses unique challenges of using LLMs in empirical studies
- Promotes scientific rigor in a rapidly evolving research landscape
This research is crucial for engineering teams as it enables more reliable evaluation of LLM capabilities, ensuring that implementation decisions are based on trustworthy evidence rather than hype.
Towards Evaluation Guidelines for Empirical Studies involving LLMs