Benchmarking Data Leakage in LLMs

This research investigates the critical issue of data leakage in LLMs when applied to software engineering tasks, measuring how often evaluation datasets were unintentionally included in model training data.

Evaluated data leakage across 83 software engineering benchmarks
Created a novel methodology to detect when LLMs have "seen" evaluation data
Revealed concerning patterns of data contamination that may invalidate many published results
Demonstrated how leakage can lead to overestimated performance in code generation and program repair tasks

For engineering teams, this work provides crucial insights for selecting reliable benchmarks and properly evaluating LLM performance on software tasks, ensuring that apparent model capabilities aren't merely memorization.

LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks