Cleaning Up Vulnerability Detection in Code

Cleaning Up Vulnerability Detection in Code

Using LLMs to spot real vulnerabilities in software commits

CleanVul introduces a novel approach to accurately identify vulnerable code changes using large language models as intelligent filters.

  • Reduces dataset noise from 40-75% to only high-quality vulnerability examples
  • Employs LLM-based heuristics to distinguish true vulnerability-fixing code from regular changes
  • Creates a cleaner dataset that improves training of vulnerability detection models
  • Demonstrates practical applications for software security teams to prioritize actual threats

This research addresses a critical challenge in cybersecurity: noisy vulnerability datasets lead to poor detection models. By providing a methodology to create high-quality datasets, security teams can build more reliable automated vulnerability detection systems.

CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

60 | 251