Skip to content

feat(evals): add First-Try Rate reliability column#2308

Open
benjamincanac wants to merge 2 commits into
mainfrom
feat/evals-first-try-rate
Open

feat(evals): add First-Try Rate reliability column#2308
benjamincanac wants to merge 2 commits into
mainfrom
feat/evals-first-try-rate

Conversation

@benjamincanac

Copy link
Copy Markdown
Member

Overview

Adds a First-Try Rate column to the /evals leaderboard and refreshes public/agent-results.json with the new reliability metrics exported by nuxt-evals.

Several models tie at 100% Success (an eval counts as passed if any of up to 4 attempts passes), so this new metric breaks the tie by measuring first-attempt reliability.

Changes

  • New "First-Try Rate" column right of Success Rate, rendering passAt1 as a percentage (muted, right-aligned) with an info tooltip. Shows when the field is absent.
  • Sorting: success rate → First-Try Rate (tiebreak) → recency. Missing passAt1 sorts below models that have it, within the same success rate.
  • Expanded rows: when an eval needed retries (totalRuns > 1), a passedRuns/totalRuns badge (e.g. 1/3) is shown with a "X of Y attempts passed" tooltip.
  • Methodology note under the table explaining how Success Rate, First-Try Rate, and Avg Duration are computed.
  • Data refresh: new agent-results.json with the additive reliability fields (passAt1, firstRunSuccess, passedRuns, totalRuns, passRate).

All new fields are optional — the page still renders against an older agent-results.json (Pass column shows , sorting falls back to recency).

Notes

avgPassRate is intentionally not displayed: under the harness's early-exit retry loop it's a censored/biased estimator, so only passAt1 (an uncensored first-run proportion) is surfaced. A follow-up on the nuxt-evals side to drop early-exit would make the retry stats fully unbiased; once that lands, the "failed twice before passing" wording in the methodology note should be updated to the regime-agnostic phrasing already used by the badge tooltip.

Verify

  • 100%-success models now ordered by First-Try Rate (Fable 5 & Cursor Composer 2.0 at 97% above Opus 4.8 / Gemini 3 / GPT 5.3 at 90%)
  • Renders against the old data file with and unchanged sorting
  • Extra column follows the table's existing horizontal-scroll pattern on mobile
  • Lint clean

Refresh agent-results.json with the new reliability metrics and surface
first-attempt reliability on the leaderboard:

- New "First-Try Rate" column (passAt1), used as the tiebreak between
  models with the same success rate
- Expanded rows show a passedRuns/totalRuns badge with a tooltip when an
  eval needed retries
- Methodology note under the table explaining how the metrics are computed

All new JSON fields are optional, so the page still renders against an
older agent-results.json.
@benjamincanac benjamincanac requested a review from atinux as a code owner July 3, 2026 16:23
@vercel

vercel Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
nuxt Ready Ready Preview, Comment Jul 3, 2026 5:05pm

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

We've flagged this as a potential contribution without a human behind it. We welcome the thoughtful use of AI tools when contributing, but ask all contributors to follow two core principles:

  1. Never let an LLM speak for you - all comments, issues, and PR descriptions should be written in your own words, reflecting your own understanding.
  2. Never let an LLM think for you - only submit contributions you fully understand and can explain.

Please review these AI-assisted contribution guidelines and update this contribution if needed.

If this was flagged in error, we apologise! 😳 Just let us know. 🙏

@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 698674dd-4fa0-43d8-aa9d-cd6d983e6818

📥 Commits

Reviewing files that changed from the base of the PR and between 0c0bf82 and d2206e9.

📒 Files selected for processing (1)
  • app/pages/evals.vue
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/pages/evals.vue

📝 Walkthrough

Walkthrough

This change adds first-try and attempt metrics to the evaluations results page. It extends the eval, experiment, and model row data shapes with optional first-run and aggregate fields, derives and sorts by passAt1, adds a First-Try Rate column, and updates the expanded Result cell to show passedRuns/totalRuns when multiple runs exist. It also adds a minimax icon mapping and explanatory copy for the displayed metrics.

Estimated code review effort: 2 (Simple) | ~12 minutes

Sequence Diagram(s)

Not applicable — no direct request/response or async interaction flow was introduced.

Related Issues: None provided.

Related PRs: None provided.

Suggested labels: enhancement, frontend

Suggested reviewers: None provided.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: adding a First-Try Rate column to evals.
Description check ✅ Passed The description accurately describes the evals leaderboard, sorting, tooltips, and data refresh changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/evals-first-try-rate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/pages/evals.vue`:
- Around line 83-90: The sort tiebreak in sortRows is using passAt1 from the
original experiment row, which can be stale after category filtering. Update the
category-filtered row construction in evals.vue so passAt1 is recomputed from
the filtered subset whenever successRate is recalculated, and ensure sortRows
continues to use the row’s updated passAt1 value for tie-breaking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c204fd6c-ba32-4f6a-a1a1-e537eed47388

📥 Commits

Reviewing files that changed from the base of the PR and between 1a39262 and 0c0bf82.

📒 Files selected for processing (2)
  • app/pages/evals.vue
  • public/agent-results.json

Comment thread app/pages/evals.vue
When a category filter is applied, passAt1 was left at the full eval-set
value while successRate was recomputed from the subset, making both the
displayed column and the sort tiebreak stale. Recompute passAt1 from the
filtered evals via firstRunSuccess; keep undefined for older data without
the field so it still renders as "—".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant