AI agents are increasingly seen as a way to reinforce the capabilities of cybersecurity teams — but which can do the best job? Wiz has developed a benchmark suite of 257 real-world challenges spanning five offensive domains: zero-day discovery, CVE (code vulnerability) detection, API security, web security, and cloud security to find out.
Wiz tests different combinations of AI agents and their underlying AI models against the test suite to see which score the highest in each of the five categories. Scoring is deterministic and programmatic using several factors: multi-dimensional rubrics for zero-day and CVE detection; endpoint-and-severity matching for API security and lag capture for web and cloud challenges.
The benchmark tests run inside isolated Docker containers with sufficient resources and no per-challenge timeouts, so scores reflect capability rather than throttling. Each agent uses its native tools and execution model out of the box, and gets three goes at every challenge to see how it performs on average.
In the blog post announcing the Cyber model arena benchmarks, Wiz is coy about the result of its trials. Coming out top of its trials is Claude Code running on Claude Opus 4.6. Wiz, soon to be a subsidiary of Google, may not be too keen about publicizing that. However, Claude’s lead is narrow and circumstances can quickly change. And at least Gemini 3 Pro is in second place.