
Benchmarking Data Leakage in LLMs
First comprehensive study across 83 software engineering benchmarks
This research investigates the critical issue of data leakage in LLMs when applied to software engineering tasks, measuring how often evaluation datasets were unintentionally included in model training data.
- Evaluated data leakage across 83 software engineering benchmarks
- Created a novel methodology to detect when LLMs have "seen" evaluation data
- Revealed concerning patterns of data contamination that may invalidate many published results
- Demonstrated how leakage can lead to overestimated performance in code generation and program repair tasks
For engineering teams, this work provides crucial insights for selecting reliable benchmarks and properly evaluating LLM performance on software tasks, ensuring that apparent model capabilities aren't merely memorization.