Benchmarking Data Leakage in LLMs

Benchmarking Data Leakage in LLMs

First comprehensive study across 83 software engineering benchmarks

This research investigates the critical issue of data leakage in LLMs when applied to software engineering tasks, measuring how often evaluation datasets were unintentionally included in model training data.

  • Evaluated data leakage across 83 software engineering benchmarks
  • Created a novel methodology to detect when LLMs have "seen" evaluation data
  • Revealed concerning patterns of data contamination that may invalidate many published results
  • Demonstrated how leakage can lead to overestimated performance in code generation and program repair tasks

For engineering teams, this work provides crucial insights for selecting reliable benchmarks and properly evaluating LLM performance on software tasks, ensuring that apparent model capabilities aren't merely memorization.

LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks

136 | 323