feat(skill): add eval framework to measure SKILL.md effectiveness#602
feat(skill): add eval framework to measure SKILL.md effectiveness#602
Conversation
Semver Impact of This PR🟡 Minor (new features) 📋 Changelog PreviewThis is how your changes will appear in the changelog. New Features ✨
Bug Fixes 🐛Upgrade
Other
Documentation 📚
Internal Changes 🔧
🤖 This preview updates automatically when you update the PR. |
Codecov Results 📊✅ 129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
✨ No test changes detected All tests are passing successfully. ❌ Patch coverage is 66.67%. Project has 1303 uncovered lines. Files with missing lines (1)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 95.73% 95.62% -0.11%
==========================================
Files 204 204 —
Lines 29877 29739 -138
Branches 0 0 —
==========================================
+ Hits 28601 28436 -165
- Misses 1276 1303 +27
- Partials 0 0 —Generated by Codecov Action |
|
Addressed Cursor Bugbot feedback: |
betegon
left a comment
There was a problem hiding this comment.
couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)
|
|
||
| - uses: actions/cache@v5 | ||
| id: cache | ||
| with: | ||
| path: node_modules | ||
| key: node-modules-${{ hashFiles('bun.lock', 'patches/**') }} | ||
| - if: steps.cache.outputs.cache-hit != 'true' |
There was a problem hiding this comment.
Bug: The eval-skill-fork.yml workflow uses pull_request_target and checks out untrusted fork code, which is then executed with access to repository secrets, allowing for potential secret exfiltration.
Severity: CRITICAL
Suggested Fix
To prevent untrusted code execution with access to secrets, change the workflow trigger from pull_request_target to pull_request. This ensures the workflow runs in the context of the fork without access to secrets. If secrets are necessary, refactor the workflow to run only trusted code from the base repository, and avoid checking out the pull request's head commit (github.event.pull_request.head.sha).
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: .github/workflows/eval-skill-fork.yml#L40-L46
Potential issue: The GitHub Actions workflow `eval-skill-fork.yml` is triggered by
`pull_request_target`, which grants it access to repository secrets. The workflow then
checks out untrusted code from a fork pull request using `actions/checkout` with the
PR's `head.sha`. This untrusted code is subsequently executed in a step that has access
to secrets like `ANTHROPIC_API_KEY` and `SENTRY_RELEASE_BOT_PRIVATE_KEY`. A malicious
actor could modify the code in their fork PR to exfiltrate these secrets, as the
executed scripts have direct access to environment variables.
d409adf to
353155f
Compare
Two-phase eval: sends test prompts to an LLM with SKILL.md as context, then grades the planned commands on efficiency criteria (no pre-auth, no org lookup, correct fields, minimal calls, trusts auto-detection). - 8 test cases covering the failure modes from issue #598 - Deterministic checks (string matching) + LLM judge (coherence) - Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret - CI job runs on skill-related file changes, fails below 75% threshold - Fork PRs: blocked until maintainer adds eval-skill label, eval runs via pull_request_target, results posted as commit status - Label removed on synchronize (new push forces re-review) - Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval
| } | ||
| const n = Number(input); | ||
| if (Number.isNaN(n) || n < 0) { | ||
| if (Number.isNaN(n)) { |
There was a problem hiding this comment.
Negative span depth now accepted instead of defaulting
Medium Severity
The n < 0 guard was removed from parseSpanDepth, so negative inputs like "-1" now return the negative number instead of falling back to DEFAULT_SPAN_DEPTH (3). The docstring explicitly states "Invalid values fall back to default depth (3)," and previously negative values were treated as invalid. With the guard removed, negative depths pass the spans > 0 check as false, effectively disabling span trees — a silent behavior change from showing 3 levels of depth to showing none.
| for (const section of changelog.sections) { | ||
| // Skip sections whose markdown is empty after whitespace trimming | ||
| if (!section.markdown.trim()) { | ||
| continue; |
There was a problem hiding this comment.
Changelog whitespace section filtering was removed
Low Severity
The old formatChangelog skipped sections whose markdown was whitespace-only (via .trim() check) and returned empty string if all sections were empty after filtering. The new code unconditionally pushes every section's markdown, including whitespace-only ones. This produces changelog output with category headings followed by blank content, and no longer returns empty string when all sections have only whitespace.
353155f to
6644186
Compare
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| } | ||
|
|
||
| let stack: string[]; | ||
| try { |
There was a problem hiding this comment.
Removed try-catch around JSON.parse for DB data
Low Severity
The try-catch around JSON.parse(row.cursor_stack) was removed. Previously, corrupted or malformed JSON in the database would be gracefully handled by deleting the bad row and returning undefined. Now it throws an unhandled exception, which could crash pagination for any list command if the stored cursor_stack value is invalid.
| } | ||
| if (!WIDGET_TYPES.includes(dataset as (typeof WIDGET_TYPES)[number])) { | ||
| throw new ValidationError( | ||
| `Invalid --dataset value "${dataset}".\nValid datasets: ${WIDGET_TYPES.join(", ")}`, | ||
| "dataset" | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Bug: The validation for deprecated dashboard widget datasets like "discover" was removed. Users will no longer be warned to migrate to supported alternatives like "error-events" or "spans".
Severity: MEDIUM
Suggested Fix
Reintroduce the validation logic that checks for and rejects deprecated dataset types. This can be done by checking against a list of deprecated datasets before the main WIDGET_TYPES inclusion check, and throwing a ValidationError with a helpful migration message for each deprecated type.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/commands/dashboard/resolve.ts#L496-L503
Potential issue: The refactored `validateWidgetEnums` function in
`src/commands/dashboard/resolve.ts` removed logic that explicitly rejected deprecated
dataset types. Previously, using datasets like `"discover"` or `"transaction-like"`
would trigger a helpful error message guiding the user to migrate to newer alternatives.
The new validation logic only checks if the dataset is included in the `WIDGET_TYPES`
array. Since the deprecated types are still present in this array, they now pass
validation silently. This is a regression that allows users to create dashboard widgets
with deprecated datasets without receiving any warning or migration guidance.


Summary
Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.
claude-sonnet-4-6+claude-opus-4-6as agents,claude-haiku-4-5as judgeskill-evalenvironment (requires reviewer approval to use the API key)Running locally
With an Anthropic API key:
Test a single model:
Ref #598