LLMs as Security Guards

This research provides the first comprehensive benchmark of large language models (LLMs) for detecting software vulnerabilities across multiple programming languages.

LLMs can effectively detect vulnerabilities in 7 programming languages, with GPT-4 achieving the best performance
A combined approach using chain-of-thought reasoning and multiple programming contexts significantly enhances detection accuracy
Performance varies notably across different vulnerability types and programming languages
Models demonstrate strong zero-shot transfer capabilities when fine-tuned on one language and tested on others

These findings highlight LLMs' potential to transform software security practices by providing automated, cross-language vulnerability detection capabilities that can be integrated into development workflows.

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection