Benchmark Contamination in LLMs

This study examines whether Large Language Models are genuinely solving bug detection tasks or simply memorizing benchmark datasets.

Key Findings:

LLMs demonstrate concerning levels of data contamination in common bug benchmarks
Even advanced models show evidence of memorizing specific bugs rather than truly understanding them
The research reveals significant reliability issues in how we evaluate LLM performance for software engineering tasks

This research matters because it challenges the validity of current evaluation methods for LLMs in code-related tasks, highlighting the need for more robust benchmarking approaches that prevent data leakage.

Are Large Language Models Memorizing Bug Benchmarks?