Skip to content

gitbutlerapp/version-control-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

version-control-bench

A version-control benchmark for coding agents. Live results: vcbench.dev

Coding agents do a growing share of version-control work. This benchmark measures how Claude Code and Codex handle five common version-control tasks with three tools: plain git, Jujutsu (jj+skill), and GitButler (but+skill).

This is not a coding benchmark. The file changes already exist before the agent starts; the agent's job is to produce the right Git-visible state — commit boundaries, branch topology, what stays uncommitted, protected history. Each tool is scored on reliability, speed, and efficiency, judged on the resulting Git history rather than the commands used to produce it.

The benchmark is maintained by GitButler, one of the three tools measured; the grader is deterministic and all data and task definitions are public in this repo.

Latest results

Full matrix from 2026-07-03: 5 scenarios x 3 tools x 2 agents, seven runs per cell (k=7), 210 graded runs, 193 passed. GitButler passed all 70 of its runs while cutting mean wall time by roughly 68% (Codex) and 61% (Claude) versus plain git. All 17 grader failures were Claude runs on git or Jujutsu, concentrated on split-commit.

Each cell shows pass rate, mean wall time, and mean version-control commands per run. Bold marks the fastest tool that passed every run of that scenario.

Codex (gpt-5.5)

Scenario git Jujutsu GitButler
Selective commit 7/7 · 67.8s · 19 cmds 7/7 · 99.2s · 20 cmds 7/7 · 30.8s · 2 cmds
Multi-amend 7/7 · 174.4s · 44 cmds 7/7 · 208.5s · 25 cmds 7/7 · 36.5s · 6 cmds
Split commit 7/7 · 116.2s · 30 cmds 7/7 · 185.9s · 37 cmds 7/7 · 33.0s · 6 cmds
Reorder commits 7/7 · 54.4s · 11 cmds 7/7 · 58.4s · 11 cmds 7/7 · 20.6s · 2 cmds
Squash commits 7/7 · 34.1s · 11 cmds 7/7 · 43.3s · 11 cmds 7/7 · 24.3s · 3 cmds
All scenarios 35/35 · 89.4s · 23 cmds 35/35 · 119.0s · 21 cmds 35/35 · 29.0s · 4 cmds

Claude Code

Scenario git Jujutsu GitButler
Selective commit 6/7 · 169.9s · 18 cmds 5/7 · 172.3s · 21 cmds 7/7 · 52.3s · 4 cmds
Multi-amend 6/7 · 598.1s · 58 cmds 6/7 · 585.5s · 36 cmds 7/7 · 97.5s · 9 cmds
Split commit 2/7 · 294.4s · 25 cmds 1/7 · 456.4s · 43 cmds 7/7 · 157.5s · 17 cmds
Reorder commits 7/7 · 68.0s · 6 cmds 7/7 · 91.3s · 14 cmds 7/7 · 97.6s · 9 cmds
Squash commits 7/7 · 111.6s · 12 cmds 6/7 · 105.8s · 15 cmds 7/7 · 82.5s · 10 cmds
All scenarios 28/35 · 248.4s · 24 cmds 25/35 · 282.2s · 26 cmds 35/35 · 97.5s · 10 cmds

Both agents are run to check whether the tool effect holds across them; this is not a Claude-versus-Codex comparison.

More detail:

Scenarios

Each scenario is a pre-built Git repository (a commit history plus uncommitted changes) and a plain-English instruction describing the intended result. No code is generated during a run; only the version-control operation is measured. For a friendlier walk-through with sketches, see docs/scenarios.md.

1. selective commit:   messy worktree -> [one clean validation commit] + leftovers
2. multi-amend:        dirty fixes -> old commit A, old commit C, old commit E
3. split commit:       [big mixed commit] -> [validation] [scoring] [docs]
4. reorder commits:    A B C D E F -> A D E B C F
5. squash commits:     A B C D E F G -> A [B+C] D [E+F+G]
Task What it tests QA
pilot-1-selective-validation Create a new branch and commit only input-validation changes while leaving mixed same-file and cross-file leftovers uncommitted. npm run pilot:check
pilot-2-multi-amend Route dirty hunks into three different existing commits while preserving unrelated leftovers. npm run pilot2:check
pilot-3-split-commit Replace one broad non-top commit with three semantic commits, keep later history above it, and expose leftovers as uncommitted. npm run pilot3:check
pilot-4-reorder-commits Move an adjacent commit block earlier in a six-commit branch without changing commit contents. npm run pilot4:check
pilot-5-squash-commits Squash two adjacent commit groups in a seven-commit branch into two semantic commits while preserving final contents. npm run pilot5:check

How it's scored

  • Identical instruction across tools. Each task ships as one prepared fixture repo with one plain-English instruction. The tool's name does not appear in the prompt; the agent decides how to carry it out.
  • Deterministic grader. Correctness is checked by a hidden, scripted verifier that inspects the final Git state: commit boundaries, branch topology, and what stayed uncommitted. It is not an LLM judge and does not compare commands against a reference — two different command sequences pass if they produce the same history.
  • Timing boundary. Fixture build, workspace prep, skill installation, and dirty-state application all happen before timing begins; the measured figures cover only the agent's work on the task.
  • Git write restriction. In GitButler and Jujutsu runs, raw git write commands are blocked so the agent must use the tool under test. Git calls a tool makes internally count as tool-internal work, not agent commands.
  • k=7. Every agent-tool-task cell ran seven times; reported numbers are means over those runs.

Full method docs: benchmark design, scoring and validation, fairness and anti-cheat, results presentation.

What's here

  • Five VC-only pilot tasks under tasks/, with a quick index at tasks/README.md.
  • Synthetic TypeScript fixtures generated by scripts/create-pilot*-fixture.mjs.
  • Hidden oracle verifiers under scripts/verify-pilot*.mjs.
  • Reference git and but solutions in each task directory.
  • An agent runner for Codex and Claude: scripts/run-pilot-agent.mjs.
  • Tool-policy wrappers that block the wrong write tool per arm and split measurements into task, platform, and tool-internal commands.
  • Checked-in result summaries under docs/results/.
  • The vcbench.dev results site under web/.

Design notes live in docs/README.md; this root README doubles as the operator runbook below.

Running the benchmark

Verifier QA

npm run pilot:check
npm run pilot2:check
npm run pilot3:check
npm run pilot4:check
npm run pilot5:check

Each check proves no-op and known-wrong states fail, then verifies the reference solutions.

Agent trials

Run one task with Codex:

npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm git
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'but+skill'
npm run pilot:agent -- --task pilot-3-split-commit --agent codex --arm 'jj+skill'

Run one task with Claude:

npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm git
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'but+skill'
npm run pilot:agent -- --task pilot-5-squash-commits --agent claude --arm 'jj+skill'

Defaults are --task pilot-1-selective-validation, --agent codex, --arm git, and Codex model gpt-5.5. Use --model <name> to override.

The supported arms are:

  • git: plain Git is allowed for version-control writes; but and jj are blocked.
  • but+skill: GitButler is prepared before the measured run, the GitButler skill is installed into .codex/skills/but and .claude/skills/but, local AGENTS.md / CLAUDE.md files are written, and raw Git write commands are blocked.
  • jj+skill: the fixture repo is prepared with jj git init --colocate, the external onevcat/skills@onevcat-jj skill is fetched into the run directory and installed into the agent skill folders, local AGENTS.md / CLAUDE.md files are written, and raw Git writes plus GitButler are blocked.

Pre-run fixture setup, tool setup, applying task branches, skill installation, and dirty-state application are excluded from measured agent duration and command metrics.

Local GitButler build

Build but from the local GitButler checkout and pass it to the runner:

npm run but:build
npm run pilot:check -- --but-bin /Users/kiril/src/gitbutler/target/release/but
npm run pilot:agent -- --agent codex --arm 'but+skill' --but-bin /Users/kiril/src/gitbutler/target/release/but

Use --skill-dir <path> to test a different GitButler skill directory.

Local Jujutsu setup

The jj+skill arm uses the jj binary found on PATH by default. Override it with --jj-bin <path>.

By default, the runner fetches the external onevcat/skills@onevcat-jj skill from https://raw.githubusercontent.com/onevcat/skills/master/skills/onevcat-jj/SKILL.md. Use --jj-skill-dir <path> to use a local copy, or --jj-skill-package, --jj-skill-name, and --jj-skill-url to point at another public skill.

Codex isolation

Codex trials use clean config by default: isolated per-run CODEX_HOME, auth material only, ignored user rules, ephemeral mode, and plugins disabled. That keeps user config and plugin noise out of timing and transcript measurements.

Useful debug knobs:

npm run pilot:agent -- --agent codex --arm git --codex-isolated-home false
npm run pilot:agent -- --agent codex --arm git --codex-disable-plugins false
npm run pilot:agent -- --agent codex --arm git --codex-clean-config false

Outputs

Run artifacts are written under tmp/pilot-runs/ and ignored by Git. A run directory contains the sandbox workspace, result.json, the command trace, generated instruction files, and verifier output.

The useful measurement block is measurement, not the older coarse metrics block. It separates:

  • task-relevant VC commands
  • platform probes from Codex or Claude startup
  • tool-internal Git calls
  • command timing
  • cold and warm-estimated transcript bytes
  • warning and skill/reference output bytes

Docs

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors