
LLMs as Security Guards
Assessing AI's effectiveness in multi-language vulnerability detection
This research provides the first comprehensive benchmark of large language models (LLMs) for detecting software vulnerabilities across multiple programming languages.
- LLMs can effectively detect vulnerabilities in 7 programming languages, with GPT-4 achieving the best performance
- A combined approach using chain-of-thought reasoning and multiple programming contexts significantly enhances detection accuracy
- Performance varies notably across different vulnerability types and programming languages
- Models demonstrate strong zero-shot transfer capabilities when fine-tuned on one language and tested on others
These findings highlight LLMs' potential to transform software security practices by providing automated, cross-language vulnerability detection capabilities that can be integrated into development workflows.
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection