Skip to content

feat(skill): add eval framework to measure SKILL.md effectiveness#602

Open
BYK wants to merge 1 commit intomainfrom
feat/skill-eval-framework
Open

feat(skill): add eval framework to measure SKILL.md effectiveness#602
BYK wants to merge 1 commit intomainfrom
feat/skill-eval-framework

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Mar 30, 2026

Summary

Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.

  • Two-phase eval: sends SKILL.md + user prompt to an LLM, then grades the planned commands with deterministic checks (string matching) and an LLM judge (coherence)
  • 8 test cases covering the failure modes from Improve skill: Avoid auth on every request, and improve knowledge about auto project detection #598: no pre-auth, no org/project lookup, correct fields, minimal calls, trusts auto-detection
  • Anthropic API with claude-sonnet-4-6 + claude-opus-4-6 as agents, claude-haiku-4-5 as judge
  • CI job runs on skill-related file changes, protected by the skill-eval environment (requires reviewer approval to use the API key)
  • Blocking — added to CI Status, fails below 75% threshold
  • Baseline: 8/8 cases passed (100%) on both models

Running locally

With an Anthropic API key:

ANTHROPIC_API_KEY=sk-ant-... bun run eval:skill

Test a single model:

EVAL_AGENT_MODELS=claude-sonnet-4-6 ANTHROPIC_API_KEY=... bun run eval:skill

Ref #598

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


New Features ✨

  • (skill) Add eval framework to measure SKILL.md effectiveness by BYK in #602
  • (telemetry) Add seer.outcome span tag for Seer command metrics by BYK in #609
  • (upgrade) Show changelog summary during CLI upgrade by BYK in #594

Bug Fixes 🐛

Upgrade

  • Prevent spinner freeze during delta patch application by BYK in #608
  • Indent changelog, add emoji to heading, hide empty sections by BYK in #604

Other

  • (dashboard) Reject MRI queries with actionable tracemetrics guidance by BYK in #601
  • (skill) Avoid unnecessary auth, reinforce auto-detection, fix field examples by BYK in #599
  • 2 bug fixes — subcommand crash, negative span depth, pagination JSON parse by cursor in #607

Documentation 📚

  • (skill) Document dashboard widget constraints and deprecated datasets by BYK in #605
  • Fix documentation gaps and embed skill files at build time by cursor in #606

Internal Changes 🔧

  • Regenerate skill files and command docs by github-actions[bot] in 664362ca

🤖 This preview updates automatically when you update the PR.

@BYK BYK marked this pull request as ready for review March 30, 2026 11:13
@BYK BYK requested review from MathurAditya724 and betegon March 30, 2026 11:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

Codecov Results 📊

129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests
Passed Tests
Failed Tests
Skipped Tests

✨ No test changes detected

All tests are passing successfully.

❌ Patch coverage is 66.67%. Project has 1303 uncovered lines.
❌ Project coverage is 95.62%. Comparing base (base) to head (head).

Files with missing lines (1)
File Patch % Lines
src/lib/formatters/human.ts 6.25% ⚠️ 15 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.73%    95.62%    -0.11%
==========================================
  Files          204       204         —
  Lines        29877     29739      -138
  Branches         0         0         —
==========================================
+ Hits         28601     28436      -165
- Misses        1276      1303       +27
- Partials         0         0         —

Generated by Codecov Action

@BYK
Copy link
Copy Markdown
Member Author

BYK commented Mar 30, 2026

Addressed Cursor Bugbot feedback: expected-patterns check in judge.ts now uses allCommands.some(cmd => cmd.includes(...)) per-command instead of matching against the concatenated joined string. This prevents false positives from patterns matching across command boundaries, consistent with how anti-patterns already works.

Copy link
Copy Markdown
Member

@betegon betegon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)

Comment on lines +40 to +46

- uses: actions/cache@v5
id: cache
with:
path: node_modules
key: node-modules-${{ hashFiles('bun.lock', 'patches/**') }}
- if: steps.cache.outputs.cache-hit != 'true'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The eval-skill-fork.yml workflow uses pull_request_target and checks out untrusted fork code, which is then executed with access to repository secrets, allowing for potential secret exfiltration.
Severity: CRITICAL

Suggested Fix

To prevent untrusted code execution with access to secrets, change the workflow trigger from pull_request_target to pull_request. This ensures the workflow runs in the context of the fork without access to secrets. If secrets are necessary, refactor the workflow to run only trusted code from the base repository, and avoid checking out the pull request's head commit (github.event.pull_request.head.sha).

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: .github/workflows/eval-skill-fork.yml#L40-L46

Potential issue: The GitHub Actions workflow `eval-skill-fork.yml` is triggered by
`pull_request_target`, which grants it access to repository secrets. The workflow then
checks out untrusted code from a fork pull request using `actions/checkout` with the
PR's `head.sha`. This untrusted code is subsequently executed in a step that has access
to secrets like `ANTHROPIC_API_KEY` and `SENTRY_RELEASE_BOT_PRIVATE_KEY`. A malicious
actor could modify the code in their fork PR to exfiltrate these secrets, as the
executed scripts have direct access to environment variables.

@BYK BYK closed this Mar 31, 2026
@BYK BYK reopened this Mar 31, 2026
@BYK BYK force-pushed the feat/skill-eval-framework branch from d409adf to 353155f Compare March 31, 2026 10:50
Two-phase eval: sends test prompts to an LLM with SKILL.md as context,
then grades the planned commands on efficiency criteria (no pre-auth,
no org lookup, correct fields, minimal calls, trusts auto-detection).

- 8 test cases covering the failure modes from issue #598
- Deterministic checks (string matching) + LLM judge (coherence)
- Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret
- CI job runs on skill-related file changes, fails below 75% threshold
- Fork PRs: blocked until maintainer adds eval-skill label, eval runs
  via pull_request_target, results posted as commit status
- Label removed on synchronize (new push forces re-review)
- Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval
}
const n = Number(input);
if (Number.isNaN(n) || n < 0) {
if (Number.isNaN(n)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative span depth now accepted instead of defaulting

Medium Severity

The n < 0 guard was removed from parseSpanDepth, so negative inputs like "-1" now return the negative number instead of falling back to DEFAULT_SPAN_DEPTH (3). The docstring explicitly states "Invalid values fall back to default depth (3)," and previously negative values were treated as invalid. With the guard removed, negative depths pass the spans > 0 check as false, effectively disabling span trees — a silent behavior change from showing 3 levels of depth to showing none.

Fix in Cursor Fix in Web

for (const section of changelog.sections) {
// Skip sections whose markdown is empty after whitespace trimming
if (!section.markdown.trim()) {
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog whitespace section filtering was removed

Low Severity

The old formatChangelog skipped sections whose markdown was whitespace-only (via .trim() check) and returned empty string if all sections were empty after filtering. The new code unconditionally pushes every section's markdown, including whitespace-only ones. This produces changelog output with category headings followed by blank content, and no longer returns empty string when all sections have only whitespace.

Fix in Cursor Fix in Web

@BYK BYK force-pushed the feat/skill-eval-framework branch from 353155f to 6644186 Compare March 31, 2026 10:52
@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/pr-preview/pr-602/

Built to branch gh-pages at 2026-03-31 10:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

}

let stack: string[];
try {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed try-catch around JSON.parse for DB data

Low Severity

The try-catch around JSON.parse(row.cursor_stack) was removed. Previously, corrupted or malformed JSON in the database would be gracefully handled by deleting the bad row and returning undefined. Now it throws an unhandled exception, which could crash pagination for any list command if the stored cursor_stack value is invalid.

Fix in Cursor Fix in Web

Comment on lines -496 to -503
}
if (!WIDGET_TYPES.includes(dataset as (typeof WIDGET_TYPES)[number])) {
throw new ValidationError(
`Invalid --dataset value "${dataset}".\nValid datasets: ${WIDGET_TYPES.join(", ")}`,
"dataset"
);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The validation for deprecated dashboard widget datasets like "discover" was removed. Users will no longer be warned to migrate to supported alternatives like "error-events" or "spans".
Severity: MEDIUM

Suggested Fix

Reintroduce the validation logic that checks for and rejects deprecated dataset types. This can be done by checking against a list of deprecated datasets before the main WIDGET_TYPES inclusion check, and throwing a ValidationError with a helpful migration message for each deprecated type.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/commands/dashboard/resolve.ts#L496-L503

Potential issue: The refactored `validateWidgetEnums` function in
`src/commands/dashboard/resolve.ts` removed logic that explicitly rejected deprecated
dataset types. Previously, using datasets like `"discover"` or `"transaction-like"`
would trigger a helpful error message guiding the user to migrate to newer alternatives.
The new validation logic only checks if the dataset is included in the `WIDGET_TYPES`
array. Since the deprecated types are still present in this array, they now pass
validation silently. This is a regression that allows users to create dashboard widgets
with deprecated datasets without receiving any warning or migration guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants