Cybench

A benchmark for evaluating the cybersecurity capabilities and risks of language models.

Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. We add subtasks, which break down a task into intermediary steps for more gradated evaluation, to these tasks.

There's an all-new, real-world BountyBench that evaluates offensive and defensive cybersecurity agents on vulnerability detection, exploitation, and patching with dollar impact. Check it out here.

Leaderboard

* Results for HAL leaderboard evaluation.

** Results for Claude 4.5 Sonnet, Claude 4.1 Opus, Claude 4 Opus, and Claude 4 Sonnet are from the Claude 4.5 Sonnet System Card on a subset of 37 problems. Percentages are estimates based on probability of success on 1 trial.

† The scores for OpenAI o3-mini and OpenAI o1-mini are inflated because HAL likely ran on a fork of the Inspect framework that leaked the answer to a task that both models completed successfully. Their Unguided % Solved scores have been adjusted downward by 2.5% (to 22.5% and 10% respectively) and their FSTs have been updated to reflect their most difficult tasks solved excluding the leaked task (42 min for o3-mini and 11 min for o1-mini).

Unguided % Solved: Success rate without subtask guidance.
Subtask-Guided % Solved: Success rate with subtask guidance.
Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.
Most Difficult Task Solved (First Solve Time by Humans): The highest first solve time of successfully solved tasks. First solve time is the time it takes for the first team to solve a given challenge in a CTF competition.

About

Each task includes a task description, starter files, and an evaluator. A task can also have subtasks, each with an associated question and answer which are scored sequentially for incremental progress. The environment (S) consists of the Kali Linux container containing any task-specific local files and any task server(s) instantiated by remote files. The agent can directly interact through bash commands with the local files and/or indirectly interact through network calls with the remote files. The agent provides a response (R), which contains an action (A), which yields an observation (O) that is added to the agent's memory (M). Later, the agent can submit its answer, which the evaluator will compare against the answer key.

Ethics Statement

Agents for offensive cybersecurity are dual use, both for white hat actors to do penetration testing and improve system security and for black hat actors to mount attacks and do other misdeeds. We have chosen to release our code publicly along with all the details of our runs because our testing did not reveal significant risks, and we believe that releasing code publicly will do more to benefit security than cause harm. Releasing our framework can significantly mitigate risks of new LMs and agents. The framework can be used to track the progress of LMs for penetration testing, and can help other researchers evaluate any risks relating to their work. For a more detailed ethics statement explaining our decision to release our framework, please see Section Ethics Statement in the paper.

Impact

The US AISI and UK AISI leveraged Cybench as the only open source cybersecurity benchmark on their Joint Pre-Deployment Test on Anthropic's Claude 3.5 Sonnet and Joint Pre-Deployment Test on OpenAI o1.
The UK AISI has incorporated Cybench in its Inspect Evals framework.
Anthropic featured Cybench results in its Claude 3.7 Sonnet System Card, Claude 4 System Card, Claude Opus 4.1 System Card, Claude Sonnet 4.5 System Card, and Claude Haiku 4.5 System Card.
Amazon featured Cybench in its evaluation suite for the Amazon Nova Premier Model Card.
xAI featured Cybench in its xAI Risk Management Framework.
OWASP leveraged Cybench as only benchmark for its LLM Exploit Generation Whitepaper.
The Center for AI Safety selected Cybench as a First Prize winning benchmark in the SafeBench competition.
The Japan AISI and the Republic of Korea AISI leveraged Cybench, providing Japanese and Korean translations to evaluate effect of language on the benchmark.

If you rely on Cybench and artifacts, we request that you cite to the underlying paper.

@inproceedings{

zhang2025cybench,

title = {Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models},

author = {Andy K Zhang and Neil Perry and Riya Dulepet and Joey Ji and Celeste Menders and Justin W Lin and Eliot Jones and Gashon Hussein and Samantha Liu and Donovan Julian Jasper and Pura Peetathawatchai and Ari Glenn and Vikram Sivashankar and Daniel Zamoshchin and Leo Glikbarg and Derek Askaryar and Haoxiang Yang and Aolin Zhang and Rishi Alluri and Nathan Tran and Rinnara Sangpisit and Kenny O Oseleononmen and Dan Boneh and Daniel E. Ho and Percy Liang},

booktitle = {The Thirteenth International Conference on Learning Representations},

year = {2025},

url = {https://openreview.net/forum?id=tc90LV0yRL},

}