A version-control benchmark for coding agents. Live results: vcbench.dev
Coding agents do a growing share of version-control work. This benchmark measures how Claude Code and Codex handle five common version-control tasks with three tools: plain git, Jujutsu (jj+skill), and GitButler (but+skill).
This is not a coding benchmark. The file changes already exist before the agent starts; the agent's job is to produce the right Git-visible state — commit boundaries, branch topology, what stays uncommitted, protected history. Each tool is scored on reliability, speed, and efficiency, judged on the resulting Git history rather than the commands used to produce it.
The benchmark is maintained by GitButler, one of the three tools measured; the grader is deterministic and all data and task definitions are public in this repo.
Full matrix from 2026-07-03: 5 scenarios x 3 tools x 2 agents, seven runs per cell (k=7), 210 graded runs, 193 passed. GitButler passed all 70 of its runs while cutting mean wall time by roughly 68% (Codex) and 61% (Claude) versus plain git. All 17 grader failures were Claude runs on git or Jujutsu, concentrated on split-commit.
Each cell shows pass rate, mean wall time, and mean version-control commands per run. Bold marks the fastest tool that passed every run of that scenario.
| Scenario | git | Jujutsu | GitButler |
|---|---|---|---|
| Selective commit | 7/7 · 67.8s · 19 cmds | 7/7 · 99.2s · 20 cmds | 7/7 · 30.8s · 2 cmds |
| Multi-amend | 7/7 · 174.4s · 44 cmds | 7/7 · 208.5s · 25 cmds | 7/7 · 36.5s · 6 cmds |
| Split commit | 7/7 · 116.2s · 30 cmds | 7/7 · 185.9s · 37 cmds | 7/7 · 33.0s · 6 cmds |
| Reorder commits | 7/7 · 54.4s · 11 cmds | 7/7 · 58.4s · 11 cmds | 7/7 · 20.6s · 2 cmds |
| Squash commits | 7/7 · 34.1s · 11 cmds | 7/7 · 43.3s · 11 cmds | 7/7 · 24.3s · 3 cmds |
| All scenarios | 35/35 · 89.4s · 23 cmds | 35/35 · 119.0s · 21 cmds | 35/35 · 29.0s · 4 cmds |
| Scenario | git | Jujutsu | GitButler |
|---|---|---|---|
| Selective commit | 6/7 · 169.9s · 18 cmds | 5/7 · 172.3s · 21 cmds | 7/7 · 52.3s · 4 cmds |
| Multi-amend | 6/7 · 598.1s · 58 cmds | 6/7 · 585.5s · 36 cmds | 7/7 · 97.5s · 9 cmds |
| Split commit | 2/7 · 294.4s · 25 cmds | 1/7 · 456.4s · 43 cmds | 7/7 · 157.5s · 17 cmds |
| Reorder commits | 7/7 · 68.0s · 6 cmds | 7/7 · 91.3s · 14 cmds | 7/7 · 97.6s · 9 cmds |
| Squash commits | 7/7 · 111.6s · 12 cmds | 6/7 · 105.8s · 15 cmds | 7/7 · 82.5s · 10 cmds |
| All scenarios | 28/35 · 248.4s · 24 cmds | 25/35 · 282.2s · 26 cmds | 35/35 · 97.5s · 10 cmds |
Both agents are run to check whether the tool effect holds across them; this is not a Claude-versus-Codex comparison.
More detail:
- Interactive results with per-scenario breakdowns and the failure ledger: vcbench.dev (source in web/).
- Checked-in results overview: docs/results/README.md.
- Latest full-matrix writeup: docs/results/full-k7-2026-07-03.md.
Each scenario is a pre-built Git repository (a commit history plus uncommitted changes) and a plain-English instruction describing the intended result. No code is generated during a run; only the version-control operation is measured. For a friendlier walk-through with sketches, see docs/scenarios.md.
1. selective commit: messy worktree -> [one clean validation commit] + leftovers
2. multi-amend: dirty fixes -> old commit A, old commit C, old commit E
3. split commit: [big mixed commit] -> [validation] [scoring] [docs]
4. reorder commits: A B C D E F -> A D E B C F
5. squash commits: A B C D E F G -> A [B+C] D [E+F+G]
| Task | What it tests | QA |
|---|---|---|
pilot-1-selective-validation |
Create a new branch and commit only input-validation changes while leaving mixed same-file and cross-file leftovers uncommitted. | npm run pilot:check |
pilot-2-multi-amend |
Route dirty hunks into three different existing commits while preserving unrelated leftovers. | npm run pilot2:check |
pilot-3-split-commit |
Replace one broad non-top commit with three semantic commits, keep later history above it, and expose leftovers as uncommitted. | npm run pilot3:check |
pilot-4-reorder-commits |
Move an adjacent commit block earlier in a six-commit branch without changing commit contents. | npm run pilot4:check |
pilot-5-squash-commits |
Squash two adjacent commit groups in a seven-commit branch into two semantic commits while preserving final contents. | npm run pilot5:check |
- Identical instruction across tools. Each task ships as one prepared fixture repo with one plain-English instruction. The tool's name does not appear in the prompt; the agent decides how to carry it out.
- Deterministic grader. Correctness is checked by a hidden, scripted verifier that inspects the final Git state: commit boundaries, branch topology, and what stayed uncommitted. It is not an LLM judge and does not compare commands against a reference — two different command sequences pass if they produce the same history.
- Timing boundary. Fixture build, workspace prep, skill installation, and dirty-state application all happen before timing begins; the measured figures cover only the agent's work on the task.
- Git write restriction. In GitButler and Jujutsu runs, raw git write commands are blocked so the agent must use the tool under test. Git calls a tool makes internally count as tool-internal work, not agent commands.
- k=7. Every agent-tool-task cell ran seven times; reported numbers are means over those runs.
Full method docs: benchmark design, scoring and validation, fairness and anti-cheat, results presentation.
- Five VC-only pilot tasks under tasks/, with a quick index at tasks/README.md.
- Synthetic TypeScript fixtures generated by
scripts/create-pilot*-fixture.mjs. - Hidden oracle verifiers under
scripts/verify-pilot*.mjs. - Reference
gitandbutsolutions in each task directory. - An agent runner for Codex and Claude:
scripts/run-pilot-agent.mjs. - Tool-policy wrappers that block the wrong write tool per arm and split measurements into task, platform, and tool-internal commands.
- Checked-in result summaries under docs/results/.
- The vcbench.dev results site under web/.
Design notes live in docs/README.md; this root README doubles as the operator runbook below.
npm run pilot:check
npm run pilot2:check
npm run pilot3:check
npm run pilot4:check
npm run pilot5:checkEach check proves no-op and known-wrong states fail, then verifies the reference solutions.
Run one task with Codex:
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm git
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'but+skill'
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'jj+skill'Run one task with Claude:
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm git
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'but+skill'
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'jj+skill'Defaults are --task pilot-1-selective-validation, --agent codex, --arm git, and Codex model gpt-5.5. Use --model <name> to override.
The supported arms are:
git: plain Git is allowed for version-control writes;butandjjare blocked.but+skill: GitButler is prepared before the measured run, the GitButler skill is installed into.codex/skills/butand.claude/skills/but, localAGENTS.md/CLAUDE.mdfiles are written, and raw Git write commands are blocked.jj+skill: the fixture repo is prepared withjj git init --colocate, the externalonevcat/skills@onevcat-jjskill is fetched into the run directory and installed into the agent skill folders, localAGENTS.md/CLAUDE.mdfiles are written, and raw Git writes plus GitButler are blocked.
Pre-run fixture setup, tool setup, applying task branches, skill installation, and dirty-state application are excluded from measured agent duration and command metrics.
Build but from the local GitButler checkout and pass it to the runner:
npm run but:build
npm run pilot:check -- --but-bin /Users/kiril/src/gitbutler/target/release/but
npm run pilot:agent -- --agent codex --arm 'but+skill' --but-bin /Users/kiril/src/gitbutler/target/release/butUse --skill-dir <path> to test a different GitButler skill directory.
The jj+skill arm uses the jj binary found on PATH by default. Override it with --jj-bin <path>.
By default, the runner fetches the external onevcat/skills@onevcat-jj skill from https://raw.githubusercontent.com/onevcat/skills/master/skills/onevcat-jj/SKILL.md. Use --jj-skill-dir <path> to use a local copy, or --jj-skill-package, --jj-skill-name, and --jj-skill-url to point at another public skill.
Codex trials use clean config by default: isolated per-run CODEX_HOME, auth material only, ignored user rules, ephemeral mode, and plugins disabled. That keeps user config and plugin noise out of timing and transcript measurements.
Useful debug knobs:
npm run pilot:agent -- --agent codex --arm git --codex-isolated-home false
npm run pilot:agent -- --agent codex --arm git --codex-disable-plugins false
npm run pilot:agent -- --agent codex --arm git --codex-clean-config falseRun artifacts are written under tmp/pilot-runs/ and ignored by Git. A run directory contains the sandbox workspace, result.json, the command trace, generated instruction files, and verifier output.
The useful measurement block is measurement, not the older coarse metrics block. It separates:
- task-relevant VC commands
- platform probes from Codex or Claude startup
- tool-internal Git calls
- command timing
- cold and warm-estimated transcript bytes
- warning and skill/reference output bytes
- docs/scenarios.md: plain-English scenario guide with sketches.
- docs/benchmark-design.md: benchmark model, task lifecycle, arm setup, reporting.
- docs/task-format.md: task package shape and fixture rules.
- docs/scoring-and-validation.md: Git-state oracles, semantic edit atoms, failure classes, metrics.
- docs/fairness-and-anti-cheat.md: tool-policy boundaries and leakage prevention.
- docs/results-presentation.md: how to report batches without cherry-picking.
- docs/research-notes.md: external benchmark research.