feat(evals): add First-Try Rate reliability column by benjamincanac · Pull Request #2308 · nuxt/nuxt.com

benjamincanac · 2026-07-03T16:23:26Z

Overview

Adds a First-Try Rate column to the /evals leaderboard and refreshes public/agent-results.json with the new reliability metrics exported by nuxt-evals.

Several models tie at 100% Success (an eval counts as passed if any of up to 4 attempts passes), so this new metric breaks the tie by measuring first-attempt reliability.

Changes

New "First-Try Rate" column right of Success Rate, rendering passAt1 as a percentage (muted, right-aligned) with an info tooltip. Shows — when the field is absent.
Sorting: success rate → First-Try Rate (tiebreak) → recency. Missing passAt1 sorts below models that have it, within the same success rate.
Expanded rows: when an eval needed retries (totalRuns > 1), a passedRuns/totalRuns badge (e.g. 1/3) is shown with a "X of Y attempts passed" tooltip.
Methodology note under the table explaining how Success Rate, First-Try Rate, and Avg Duration are computed.
Data refresh: new agent-results.json with the additive reliability fields (passAt1, firstRunSuccess, passedRuns, totalRuns, passRate).

All new fields are optional — the page still renders against an older agent-results.json (Pass column shows —, sorting falls back to recency).

Notes

avgPassRate is intentionally not displayed: under the harness's early-exit retry loop it's a censored/biased estimator, so only passAt1 (an uncensored first-run proportion) is surfaced. A follow-up on the nuxt-evals side to drop early-exit would make the retry stats fully unbiased; once that lands, the "failed twice before passing" wording in the methodology note should be updated to the regime-agnostic phrasing already used by the badge tooltip.

Verify

100%-success models now ordered by First-Try Rate (Fable 5 & Cursor Composer 2.0 at 97% above Opus 4.8 / Gemini 3 / GPT 5.3 at 90%)
Renders against the old data file with — and unchanged sorting
Extra column follows the table's existing horizontal-scroll pattern on mobile
Lint clean

Refresh agent-results.json with the new reliability metrics and surface first-attempt reliability on the leaderboard: - New "First-Try Rate" column (passAt1), used as the tiebreak between models with the same success rate - Expanded rows show a passedRuns/totalRuns badge with a tooltip when an eval needed retries - Methodology note under the table explaining how the metrics are computed All new JSON fields are optional, so the page still renders against an older agent-results.json.

vercel · 2026-07-03T16:23:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
nuxt	Ready	Preview, Comment	Jul 3, 2026 5:05pm

github-actions · 2026-07-03T16:23:36Z

We've flagged this as a potential contribution without a human behind it. We welcome the thoughtful use of AI tools when contributing, but ask all contributors to follow two core principles:

Never let an LLM speak for you - all comments, issues, and PR descriptions should be written in your own words, reflecting your own understanding.
Never let an LLM think for you - only submit contributions you fully understand and can explain.

Please review these AI-assisted contribution guidelines and update this contribution if needed.

If this was flagged in error, we apologise! 😳 Just let us know. 🙏

coderabbitai · 2026-07-03T16:30:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 698674dd-4fa0-43d8-aa9d-cd6d983e6818

📥 Commits

Reviewing files that changed from the base of the PR and between 0c0bf82 and d2206e9.

📒 Files selected for processing (1)

app/pages/evals.vue

🚧 Files skipped from review as they are similar to previous changes (1)

app/pages/evals.vue

📝 Walkthrough

Walkthrough

This change adds first-try and attempt metrics to the evaluations results page. It extends the eval, experiment, and model row data shapes with optional first-run and aggregate fields, derives and sorts by passAt1, adds a First-Try Rate column, and updates the expanded Result cell to show passedRuns/totalRuns when multiple runs exist. It also adds a minimax icon mapping and explanatory copy for the displayed metrics.

Estimated code review effort: 2 (Simple) | ~12 minutes

Sequence Diagram(s)

Not applicable — no direct request/response or async interaction flow was introduced.

Related Issues: None provided.

Related PRs: None provided.

Suggested labels: enhancement, frontend

Suggested reviewers: None provided.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main change: adding a First-Try Rate column to evals.
Description check	✅ Passed	The description accurately describes the evals leaderboard, sorting, tooltips, and data refresh changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/evals-first-try-rate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/pages/evals.vue`:
- Around line 83-90: The sort tiebreak in sortRows is using passAt1 from the
original experiment row, which can be stale after category filtering. Update the
category-filtered row construction in evals.vue so passAt1 is recomputed from
the filtered subset whenever successRate is recalculated, and ensure sortRows
continues to use the row’s updated passAt1 value for tie-breaking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c204fd6c-ba32-4f6a-a1a1-e537eed47388

📥 Commits

Reviewing files that changed from the base of the PR and between 1a39262 and 0c0bf82.

📒 Files selected for processing (2)

app/pages/evals.vue
public/agent-results.json

When a category filter is applied, passAt1 was left at the full eval-set value while successRate was recomputed from the subset, making both the displayed column and the sort tiebreak stale. Recompute passAt1 from the filtered evals via firstRunSuccess; keep undefined for older data without the field so it still renders as "—".

benjamincanac requested a review from atinux as a code owner July 3, 2026 16:23

github-actions Bot added agentscan:mixed-signals possible bot labels Jul 3, 2026

vercel Bot deployed to Preview July 3, 2026 16:23 View deployment

coderabbitai Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread app/pages/evals.vue

vercel Bot deployed to Preview July 3, 2026 16:45 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evals): add First-Try Rate reliability column#2308

feat(evals): add First-Try Rate reliability column#2308
benjamincanac wants to merge 2 commits into
mainfrom
feat/evals-first-try-rate

benjamincanac commented Jul 3, 2026

Uh oh!

vercel Bot commented Jul 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

benjamincanac commented Jul 3, 2026

Overview

Changes

Notes

Verify

Uh oh!

vercel Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 3, 2026

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading