version-control-bench

A version-control benchmark for coding agents. Live results: vcbench.dev

Coding agents do a growing share of version-control work. This benchmark measures how Claude Code and Codex handle five common version-control tasks with three tools: plain git, Jujutsu (jj+skill), and GitButler (but+skill).

This is not a coding benchmark. The file changes already exist before the agent starts; the agent's job is to produce the right Git-visible state — commit boundaries, branch topology, what stays uncommitted, protected history. Each tool is scored on reliability, speed, and efficiency, judged on the resulting Git history rather than the commands used to produce it.

The benchmark is maintained by GitButler, one of the three tools measured; the grader is deterministic and all data and task definitions are public in this repo.

Latest results

Full matrix from 2026-07-03: 5 scenarios x 3 tools x 2 agents, seven runs per cell (k=7), 210 graded runs, 193 passed. GitButler passed all 70 of its runs while cutting mean wall time by roughly 68% (Codex) and 61% (Claude) versus plain git. All 17 grader failures were Claude runs on git or Jujutsu, concentrated on split-commit.

Each cell shows pass rate, mean wall time, and mean version-control commands per run. Bold marks the fastest tool that passed every run of that scenario.

Codex (gpt-5.5)

Scenario	git	Jujutsu	GitButler
Selective commit	7/7 · 67.8s · 19 cmds	7/7 · 99.2s · 20 cmds	7/7 · 30.8s · 2 cmds
Multi-amend	7/7 · 174.4s · 44 cmds	7/7 · 208.5s · 25 cmds	7/7 · 36.5s · 6 cmds
Split commit	7/7 · 116.2s · 30 cmds	7/7 · 185.9s · 37 cmds	7/7 · 33.0s · 6 cmds
Reorder commits	7/7 · 54.4s · 11 cmds	7/7 · 58.4s · 11 cmds	7/7 · 20.6s · 2 cmds
Squash commits	7/7 · 34.1s · 11 cmds	7/7 · 43.3s · 11 cmds	7/7 · 24.3s · 3 cmds
All scenarios	35/35 · 89.4s · 23 cmds	35/35 · 119.0s · 21 cmds	35/35 · 29.0s · 4 cmds

Claude Code

Scenario	git	Jujutsu	GitButler
Selective commit	6/7 · 169.9s · 18 cmds	5/7 · 172.3s · 21 cmds	7/7 · 52.3s · 4 cmds
Multi-amend	6/7 · 598.1s · 58 cmds	6/7 · 585.5s · 36 cmds	7/7 · 97.5s · 9 cmds
Split commit	2/7 · 294.4s · 25 cmds	1/7 · 456.4s · 43 cmds	7/7 · 157.5s · 17 cmds
Reorder commits	7/7 · 68.0s · 6 cmds	7/7 · 91.3s · 14 cmds	7/7 · 97.6s · 9 cmds
Squash commits	7/7 · 111.6s · 12 cmds	6/7 · 105.8s · 15 cmds	7/7 · 82.5s · 10 cmds
All scenarios	28/35 · 248.4s · 24 cmds	25/35 · 282.2s · 26 cmds	35/35 · 97.5s · 10 cmds

Both agents are run to check whether the tool effect holds across them; this is not a Claude-versus-Codex comparison.

More detail:

Interactive results with per-scenario breakdowns and the failure ledger: vcbench.dev (source in web/).
Checked-in results overview: docs/results/README.md.
Latest full-matrix writeup: docs/results/full-k7-2026-07-03.md.

Scenarios

Each scenario is a pre-built Git repository (a commit history plus uncommitted changes) and a plain-English instruction describing the intended result. No code is generated during a run; only the version-control operation is measured. For a friendlier walk-through with sketches, see docs/scenarios.md.

1. selective commit:   messy worktree -> [one clean validation commit] + leftovers
2. multi-amend:        dirty fixes -> old commit A, old commit C, old commit E
3. split commit:       [big mixed commit] -> [validation] [scoring] [docs]
4. reorder commits:    A B C D E F -> A D E B C F
5. squash commits:     A B C D E F G -> A [B+C] D [E+F+G]

Task	What it tests	QA
`pilot-1-selective-validation`	Create a new branch and commit only input-validation changes while leaving mixed same-file and cross-file leftovers uncommitted.	`npm run pilot:check`
`pilot-2-multi-amend`	Route dirty hunks into three different existing commits while preserving unrelated leftovers.	`npm run pilot2:check`
`pilot-3-split-commit`	Replace one broad non-top commit with three semantic commits, keep later history above it, and expose leftovers as uncommitted.	`npm run pilot3:check`
`pilot-4-reorder-commits`	Move an adjacent commit block earlier in a six-commit branch without changing commit contents.	`npm run pilot4:check`
`pilot-5-squash-commits`	Squash two adjacent commit groups in a seven-commit branch into two semantic commits while preserving final contents.	`npm run pilot5:check`

How it's scored

Identical instruction across tools. Each task ships as one prepared fixture repo with one plain-English instruction. The tool's name does not appear in the prompt; the agent decides how to carry it out.
Deterministic grader. Correctness is checked by a hidden, scripted verifier that inspects the final Git state: commit boundaries, branch topology, and what stayed uncommitted. It is not an LLM judge and does not compare commands against a reference — two different command sequences pass if they produce the same history.
Timing boundary. Fixture build, workspace prep, skill installation, and dirty-state application all happen before timing begins; the measured figures cover only the agent's work on the task.
Git write restriction. In GitButler and Jujutsu runs, raw git write commands are blocked so the agent must use the tool under test. Git calls a tool makes internally count as tool-internal work, not agent commands.
k=7. Every agent-tool-task cell ran seven times; reported numbers are means over those runs.

Full method docs: benchmark design, scoring and validation, fairness and anti-cheat, results presentation.

What's here

Five VC-only pilot tasks under tasks/, with a quick index at tasks/README.md.
Synthetic TypeScript fixtures generated by scripts/create-pilot*-fixture.mjs.
Hidden oracle verifiers under scripts/verify-pilot*.mjs.
Reference git and but solutions in each task directory.
An agent runner for Codex and Claude: scripts/run-pilot-agent.mjs.
Tool-policy wrappers that block the wrong write tool per arm and split measurements into task, platform, and tool-internal commands.
Checked-in result summaries under docs/results/.
The vcbench.dev results site under web/.

Design notes live in docs/README.md; this root README doubles as the operator runbook below.

Running the benchmark

Verifier QA

npm run pilot:check
npm run pilot2:check
npm run pilot3:check
npm run pilot4:check
npm run pilot5:check

Each check proves no-op and known-wrong states fail, then verifies the reference solutions.

Agent trials

Run one task with Codex:

npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm git
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'but+skill'
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'jj+skill'

Run one task with Claude:

npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm git
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'but+skill'
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'jj+skill'

Defaults are --task pilot-1-selective-validation, --agent codex, --arm git, and Codex model gpt-5.5. Use --model <name> to override.

The supported arms are:

git: plain Git is allowed for version-control writes; but and jj are blocked.
but+skill: GitButler is prepared before the measured run, the GitButler skill is installed into .codex/skills/but and .claude/skills/but, local AGENTS.md / CLAUDE.md files are written, and raw Git write commands are blocked.
jj+skill: the fixture repo is prepared with jj git init --colocate, the external onevcat/skills@onevcat-jj skill is fetched into the run directory and installed into the agent skill folders, local AGENTS.md / CLAUDE.md files are written, and raw Git writes plus GitButler are blocked.

Pre-run fixture setup, tool setup, applying task branches, skill installation, and dirty-state application are excluded from measured agent duration and command metrics.

Local GitButler build

Build but from the local GitButler checkout and pass it to the runner:

npm run but:build
npm run pilot:check -- --but-bin /Users/kiril/src/gitbutler/target/release/but
npm run pilot:agent -- --agent codex --arm 'but+skill' --but-bin /Users/kiril/src/gitbutler/target/release/but

Use --skill-dir <path> to test a different GitButler skill directory.

Local Jujutsu setup

The jj+skill arm uses the jj binary found on PATH by default. Override it with --jj-bin <path>.

By default, the runner fetches the external onevcat/skills@onevcat-jj skill from https://raw.githubusercontent.com/onevcat/skills/master/skills/onevcat-jj/SKILL.md. Use --jj-skill-dir <path> to use a local copy, or --jj-skill-package, --jj-skill-name, and --jj-skill-url to point at another public skill.

Codex isolation

Codex trials use clean config by default: isolated per-run CODEX_HOME, auth material only, ignored user rules, ephemeral mode, and plugins disabled. That keeps user config and plugin noise out of timing and transcript measurements.

Useful debug knobs:

npm run pilot:agent -- --agent codex --arm git --codex-isolated-home false
npm run pilot:agent -- --agent codex --arm git --codex-disable-plugins false
npm run pilot:agent -- --agent codex --arm git --codex-clean-config false

Outputs

Run artifacts are written under tmp/pilot-runs/ and ignored by Git. A run directory contains the sandbox workspace, result.json, the command trace, generated instruction files, and verifier output.

The useful measurement block is measurement, not the older coarse metrics block. It separates:

task-relevant VC commands
platform probes from Codex or Claude startup
tool-internal Git calls
command timing
cold and warm-estimated transcript bytes
warning and skill/reference output bytes

Docs

docs/scenarios.md: plain-English scenario guide with sketches.
docs/benchmark-design.md: benchmark model, task lifecycle, arm setup, reporting.
docs/task-format.md: task package shape and fixture rules.
docs/scoring-and-validation.md: Git-state oracles, semantic edit atoms, failure classes, metrics.
docs/fairness-and-anti-cheat.md: tool-policy boundaries and leakage prevention.
docs/results-presentation.md: how to report batches without cherry-picking.
docs/research-notes.md: external benchmark research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

version-control-bench

Latest results

Codex (gpt-5.5)

Claude Code

Scenarios

How it's scored

What's here

Running the benchmark

Verifier QA

Agent trials

Local GitButler build

Local Jujutsu setup

Codex isolation

Outputs

Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.claude		.claude
docs		docs
scripts		scripts
tasks		tasks
web		web
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

version-control-bench

Latest results

Codex (gpt-5.5)

Claude Code

Scenarios

How it's scored

What's here

Running the benchmark

Verifier QA

Agent trials

Local GitButler build

Local Jujutsu setup

Codex isolation

Outputs

Docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages