AI Security Threat Assessment

CVE-Bench introduces the first comprehensive benchmark testing LLM agents' ability to exploit real-world web application vulnerabilities, revealing significant security implications.

Evaluates AI agents against 13 real-world CVEs across various vulnerability types
Demonstrates that advanced models like GPT-4 can successfully exploit 66% of vulnerabilities
Reveals that even when unsuccessful, LLMs often generate partially correct exploitation strategies
Identifies key factors affecting exploitation success: model capabilities, prompt engineering, and tool integration

This research provides critical insights for security professionals to understand and mitigate emerging AI-powered threats to web applications.

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities