
Cleaning Up Vulnerability Detection in Code
Using LLMs to spot real vulnerabilities in software commits
CleanVul introduces a novel approach to accurately identify vulnerable code changes using large language models as intelligent filters.
- Reduces dataset noise from 40-75% to only high-quality vulnerability examples
- Employs LLM-based heuristics to distinguish true vulnerability-fixing code from regular changes
- Creates a cleaner dataset that improves training of vulnerability detection models
- Demonstrates practical applications for software security teams to prioritize actual threats
This research addresses a critical challenge in cybersecurity: noisy vulnerability datasets lead to poor detection models. By providing a methodology to create high-quality datasets, security teams can build more reliable automated vulnerability detection systems.
CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics