
AI Security Threat Assessment
Evaluating LLM Agents' Ability to Exploit Web Vulnerabilities
CVE-Bench introduces the first comprehensive benchmark testing LLM agents' ability to exploit real-world web application vulnerabilities, revealing significant security implications.
- Evaluates AI agents against 13 real-world CVEs across various vulnerability types
- Demonstrates that advanced models like GPT-4 can successfully exploit 66% of vulnerabilities
- Reveals that even when unsuccessful, LLMs often generate partially correct exploitation strategies
- Identifies key factors affecting exploitation success: model capabilities, prompt engineering, and tool integration
This research provides critical insights for security professionals to understand and mitigate emerging AI-powered threats to web applications.
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities