feat(evals): add First-Try Rate reliability column#2308
Conversation
Refresh agent-results.json with the new reliability metrics and surface first-attempt reliability on the leaderboard: - New "First-Try Rate" column (passAt1), used as the tiebreak between models with the same success rate - Expanded rows show a passedRuns/totalRuns badge with a tooltip when an eval needed retries - Methodology note under the table explaining how the metrics are computed All new JSON fields are optional, so the page still renders against an older agent-results.json.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
We've flagged this as a potential contribution without a human behind it. We welcome the thoughtful use of AI tools when contributing, but ask all contributors to follow two core principles:
Please review these AI-assisted contribution guidelines and update this contribution if needed. If this was flagged in error, we apologise! 😳 Just let us know. 🙏 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis change adds first-try and attempt metrics to the evaluations results page. It extends the eval, experiment, and model row data shapes with optional first-run and aggregate fields, derives and sorts by Estimated code review effort: 2 (Simple) | ~12 minutes Sequence Diagram(s)Not applicable — no direct request/response or async interaction flow was introduced. Related Issues: None provided. Related PRs: None provided. Suggested labels: enhancement, frontend Suggested reviewers: None provided. 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/pages/evals.vue`:
- Around line 83-90: The sort tiebreak in sortRows is using passAt1 from the
original experiment row, which can be stale after category filtering. Update the
category-filtered row construction in evals.vue so passAt1 is recomputed from
the filtered subset whenever successRate is recalculated, and ensure sortRows
continues to use the row’s updated passAt1 value for tie-breaking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c204fd6c-ba32-4f6a-a1a1-e537eed47388
📒 Files selected for processing (2)
app/pages/evals.vuepublic/agent-results.json
When a category filter is applied, passAt1 was left at the full eval-set value while successRate was recomputed from the subset, making both the displayed column and the sort tiebreak stale. Recompute passAt1 from the filtered evals via firstRunSuccess; keep undefined for older data without the field so it still renders as "—".
Overview
Adds a First-Try Rate column to the /evals leaderboard and refreshes
public/agent-results.jsonwith the new reliability metrics exported by nuxt-evals.Several models tie at 100% Success (an eval counts as passed if any of up to 4 attempts passes), so this new metric breaks the tie by measuring first-attempt reliability.
Changes
passAt1as a percentage (muted, right-aligned) with an info tooltip. Shows—when the field is absent.passAt1sorts below models that have it, within the same success rate.totalRuns > 1), apassedRuns/totalRunsbadge (e.g.1/3) is shown with a "X of Y attempts passed" tooltip.agent-results.jsonwith the additive reliability fields (passAt1,firstRunSuccess,passedRuns,totalRuns,passRate).All new fields are optional — the page still renders against an older
agent-results.json(Pass column shows—, sorting falls back to recency).Notes
avgPassRateis intentionally not displayed: under the harness's early-exit retry loop it's a censored/biased estimator, so onlypassAt1(an uncensored first-run proportion) is surfaced. A follow-up on the nuxt-evals side to drop early-exit would make the retry stats fully unbiased; once that lands, the "failed twice before passing" wording in the methodology note should be updated to the regime-agnostic phrasing already used by the badge tooltip.Verify
—and unchanged sorting