A benchmark for evaluating the cybersecurity capabilities and risks of language models.

Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. We add subtasks, which break down a task into intermediary steps for more gradated evaluation, to these tasks.

Leaderboard

  • Unguided % Solved: Success rate without subtask guidance.
  • Subtask-Guided % Solved: Success rate with subtask guidance.
  • Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.
  • Most Difficult Task Solved (First Solve Time by Humans): The highest first solve time of successfully solved tasks. First solve time is the time it takes for the first team to solve a given challenge in a CTF competition.

About

Each task includes a task description, starter files, and an evaluator. A task can also have subtasks, each with an associated question and answer which are scored sequentially for incremental progress. The environment (S) consists of the Kali Linux container containing any task-specific local files and any task server(s) instantiated by remote files. The agent can directly interact through bash commands with the local files and/or indirectly interact through network calls with the remote files. The agent provides a response (R), which contains an action (A), which yields an observation (O) that is added to the agent's memory (M). Later, the agent can submit its answer, which the evaluator will compare against the answer key.

Categories

For task selection, we targeted tasks across 6 categories commonly found in CTF competitions.

Ethics Statement

Agents for offensive cybersecurity are dual use, both for white hat actors to do penetration testing and improve system security and for black hat actors to mount attacks and do other misdeeds. We have chosen to release our code publicly along with all the details of our runs because our testing did not reveal significant risks, and we believe that releasing code publicly will do more to benefit security than cause harm. Releasing our framework can significantly mitigate risks of new LMs and agents. The framework can be used to track the progress of LMs for penetration testing, and can help other researchers evaluate any risks relating to their work. For a more detailed ethics statement explaining our decision to release our framework, please see Section Ethics Statement in the paper.

Cite:

@misc{zhang2024cybenchframeworkevaluatingcybersecurity,
title = {Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models},
author = {Andy K. Zhang and Neil Perry and Riya Dulepet and Joey Ji and Justin W. Lin and Eliot Jones and Celeste Menders and Gashon Hussein and Samantha Liu and Donovan Jasper and Pura Peetathawatchai and Ari Glenn and Vikram Sivashankar and Daniel Zamoshchin and Leo Glikbarg and Derek Askaryar and Mike Yang and Teddy Zhang and Rishi Alluri and Nathan Tran and Rinnara Sangpisit and Polycarpos Yiorkadjis and Kenny Osele and Gautham Raghupathi and Dan Boneh and Daniel E. Ho and Percy Liang},
year = {2024},
eprint = {2408.08926},
archivePrefix = {arXiv},
primaryClass = {cs.CR},
url = {https://arxiv.org/abs/2408.08926},
note = {Accessed: PLACEHOLDER_ACCESS_DATE},
}