feat(skill): add eval framework to measure SKILL.md effectiveness by BYK · Pull Request #602 · getsentry/cli

BYK · 2026-03-30T11:12:39Z

Summary

Adds an evaluation framework that measures how effectively SKILL.md guides an LLM agent to use the Sentry CLI efficiently. Inspired by the skill-creator plugin approach of prompt → plan → grade.

Two-phase eval: sends SKILL.md + user prompt to an LLM, then grades the planned commands with deterministic checks (string matching) and an LLM judge (coherence)
8 test cases covering the failure modes from Improve skill: Avoid auth on every request, and improve knowledge about auto project detection #598: no pre-auth, no org/project lookup, correct fields, minimal calls, trusts auto-detection
Anthropic API with claude-sonnet-4-6 + claude-opus-4-6 as agents, claude-haiku-4-5 as judge
CI job runs on skill-related file changes, protected by the skill-eval environment (requires reviewer approval to use the API key)
Blocking — added to CI Status, fails below 75% threshold
Baseline: 8/8 cases passed (100%) on both models

Running locally

With an Anthropic API key:

ANTHROPIC_API_KEY=sk-ant-... bun run eval:skill

Test a single model:

EVAL_AGENT_MODELS=claude-sonnet-4-6 ANTHROPIC_API_KEY=... bun run eval:skill

Ref #598

github-actions · 2026-03-30T11:12:54Z

Semver Impact of This PR

🟡 Minor (new features)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).

New Features ✨

(skill) Add eval framework to measure SKILL.md effectiveness by BYK in #602

(telemetry) Add seer.outcome span tag for Seer command metrics by BYK in #609
(upgrade) Show changelog summary during CLI upgrade by BYK in #594

Bug Fixes 🐛

Upgrade

Prevent spinner freeze during delta patch application by BYK in #608
Indent changelog, add emoji to heading, hide empty sections by BYK in #604

Other

(dashboard) Reject MRI queries with actionable tracemetrics guidance by BYK in #601
(skill) Avoid unnecessary auth, reinforce auto-detection, fix field examples by BYK in #599
2 bug fixes — subcommand crash, negative span depth, pagination JSON parse by cursor in #607

Documentation 📚

(skill) Document dashboard widget constraints and deprecated datasets by BYK in #605
Fix documentation gaps and embed skill files at build time by cursor in #606

Internal Changes 🔧

Regenerate skill files and command docs by github-actions[bot] in 664362ca

_{🤖 This preview updates automatically when you update the PR.}

github-actions · 2026-03-30T11:14:29Z

Codecov Results 📊

✅ 129 passed | Total: 129 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	—
Passed Tests	—
Failed Tests	—
Skipped Tests	—

✨ No test changes detected

All tests are passing successfully.

❌ Patch coverage is 66.67%. Project has 1303 uncovered lines.
❌ Project coverage is 95.62%. Comparing base (base) to head (head).

Files with missing lines (1)

File	Patch %	Lines
src/lib/formatters/human.ts	6.25%	⚠️ 15 Missing

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
- Coverage    95.73%    95.62%    -0.11%
==========================================
  Files          204       204         —
  Lines        29877     29739      -138
  Branches         0         0         —
==========================================
+ Hits         28601     28436      -165
- Misses        1276      1303       +27
- Partials         0         0         —

Generated by Codecov Action

test/skill-eval/helpers/judge.ts

BYK · 2026-03-30T11:19:35Z

Addressed Cursor Bugbot feedback: expected-patterns check in judge.ts now uses allCommands.some(cmd => cmd.includes(...)) per-command instead of matching against the concatenated joined string. This prevents false positives from patterns matching across command boundaries, consistent with how anti-patterns already works.

test/skill-eval/helpers/planner.ts

betegon

couple of comments and maybe we should consider using something like https://github.com/getsentry/vitest-evals (although we're un bun test)

test/skill-eval/cases.json

test/skill-eval/helpers/judge.ts

test/skill-eval/helpers/planner.ts

.github/workflows/ci.yml

test/skill-eval/helpers/planner.ts

sentry · 2026-03-31T10:34:43Z

.github/workflows/eval-skill-fork.yml

+
+      - uses: actions/cache@v5
+        id: cache
+        with:
+          path: node_modules
+          key: node-modules-${{ hashFiles('bun.lock', 'patches/**') }}
+      - if: steps.cache.outputs.cache-hit != 'true'


Bug: The eval-skill-fork.yml workflow uses pull_request_target and checks out untrusted fork code, which is then executed with access to repository secrets, allowing for potential secret exfiltration.
_{Severity: CRITICAL}

Suggested Fix

To prevent untrusted code execution with access to secrets, change the workflow trigger from pull_request_target to pull_request. This ensures the workflow runs in the context of the fork without access to secrets. If secrets are necessary, refactor the workflow to run only trusted code from the base repository, and avoid checking out the pull request's head commit (github.event.pull_request.head.sha).

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: .github/workflows/eval-skill-fork.yml#L40-L46 Potential issue: The GitHub Actions workflow `eval-skill-fork.yml` is triggered by `pull_request_target`, which grants it access to repository secrets. The workflow then checks out untrusted code from a fork pull request using `actions/checkout` with the PR's `head.sha`. This untrusted code is subsequently executed in a step that has access to secrets like `ANTHROPIC_API_KEY` and `SENTRY_RELEASE_BOT_PRIVATE_KEY`. A malicious actor could modify the code in their fork PR to exfiltrate these secrets, as the executed scripts have direct access to environment variables.

.github/workflows/ci.yml

Two-phase eval: sends test prompts to an LLM with SKILL.md as context, then grades the planned commands on efficiency criteria (no pre-auth, no org lookup, correct fields, minimal calls, trusts auto-detection). - 8 test cases covering the failure modes from issue #598 - Deterministic checks (string matching) + LLM judge (coherence) - Uses Anthropic API (claude-sonnet-4-6, claude-opus-4-6) via repo secret - CI job runs on skill-related file changes, fails below 75% threshold - Fork PRs: blocked until maintainer adds eval-skill label, eval runs via pull_request_target, results posted as commit status - Label removed on synchronize (new push forces re-review) - Uses SENTRY_RELEASE_BOT app token to re-trigger main CI after fork eval

cursor · 2026-03-31T10:52:04Z

src/lib/arg-parsing.ts

  }
  const n = Number(input);
-  if (Number.isNaN(n) || n < 0) {
+  if (Number.isNaN(n)) {


Negative span depth now accepted instead of defaulting

Medium Severity

The n < 0 guard was removed from parseSpanDepth, so negative inputs like "-1" now return the negative number instead of falling back to DEFAULT_SPAN_DEPTH (3). The docstring explicitly states "Invalid values fall back to default depth (3)," and previously negative values were treated as invalid. With the guard removed, negative depths pass the spans > 0 check as false, effectively disabling span trees — a silent behavior change from showing 3 levels of depth to showing none.

cursor · 2026-03-31T10:52:04Z

src/lib/formatters/human.ts

  for (const section of changelog.sections) {
-    // Skip sections whose markdown is empty after whitespace trimming
-    if (!section.markdown.trim()) {
-      continue;


Changelog whitespace section filtering was removed

Low Severity

The old formatChangelog skipped sections whose markdown was whitespace-only (via .trim() check) and returned empty string if all sections were empty after filtering. The new code unconditionally pushes every section's markdown, including whitespace-only ones. This produces changelog output with category headings followed by blank content, and no longer returns empty string when all sections have only whitespace.

github-actions · 2026-03-31T10:52:33Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/pr-preview/pr-602/
Built to branch `gh-pages` at 2026-03-31 10:52 UTC. Preview will be ready when the GitHub Pages deployment is complete.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-31T10:54:54Z

src/lib/db/pagination.ts

  }

-  let stack: string[];
-  try {


Removed try-catch around JSON.parse for DB data

Low Severity

The try-catch around JSON.parse(row.cursor_stack) was removed. Previously, corrupted or malformed JSON in the database would be gracefully handled by deleting the bad row and returning undefined. Now it throws an unhandled exception, which could crash pagination for any list command if the stored cursor_stack value is invalid.

sentry · 2026-03-31T10:55:39Z

src/commands/dashboard/resolve.ts

-  }
-  if (!WIDGET_TYPES.includes(dataset as (typeof WIDGET_TYPES)[number])) {
-    throw new ValidationError(
-      `Invalid --dataset value "${dataset}".\nValid datasets: ${WIDGET_TYPES.join(", ")}`,
-      "dataset"
-    );
-  }
-}


Bug: The validation for deprecated dashboard widget datasets like "discover" was removed. Users will no longer be warned to migrate to supported alternatives like "error-events" or "spans".
_{Severity: MEDIUM}

Suggested Fix

Reintroduce the validation logic that checks for and rejects deprecated dataset types. This can be done by checking against a list of deprecated datasets before the main WIDGET_TYPES inclusion check, and throwing a ValidationError with a helpful migration message for each deprecated type.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/commands/dashboard/resolve.ts#L496-L503 Potential issue: The refactored `validateWidgetEnums` function in `src/commands/dashboard/resolve.ts` removed logic that explicitly rejected deprecated dataset types. Previously, using datasets like `"discover"` or `"transaction-like"` would trigger a helpful error message guiding the user to migrate to newer alternatives. The new validation logic only checks if the dataset is included in the `WIDGET_TYPES` array. Since the deprecated types are still present in this array, they now pass validation silently. This is a regression that allows users to create dashboard widgets with deprecated datasets without receiving any warning or migration guidance.

BYK marked this pull request as ready for review March 30, 2026 11:13

BYK requested review from MathurAditya724 and betegon March 30, 2026 11:14

cursor bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/judge.ts Show resolved Hide resolved

sentry bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/planner.ts Show resolved Hide resolved

betegon reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/cases.json Show resolved Hide resolved

test/skill-eval/cases.json Show resolved Hide resolved

test/skill-eval/cases.json Show resolved Hide resolved

sentry bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/judge.ts Show resolved Hide resolved

cursor bot reviewed Mar 30, 2026

View reviewed changes

test/skill-eval/helpers/planner.ts Outdated Show resolved Hide resolved

BYK temporarily deployed to skill-eval March 30, 2026 18:28 — with GitHub Actions Inactive

cursor bot reviewed Mar 30, 2026

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

test/skill-eval/helpers/planner.ts Show resolved Hide resolved

BYK temporarily deployed to skill-eval March 30, 2026 18:42 — with GitHub Actions Inactive

BYK temporarily deployed to skill-eval March 30, 2026 21:18 — with GitHub Actions Inactive

sentry bot reviewed Mar 31, 2026

View reviewed changes

cursor bot reviewed Mar 31, 2026

View reviewed changes

.github/workflows/ci.yml Outdated Show resolved Hide resolved

BYK closed this Mar 31, 2026

BYK reopened this Mar 31, 2026

BYK force-pushed the feat/skill-eval-framework branch from d409adf to 353155f Compare March 31, 2026 10:50

cursor bot reviewed Mar 31, 2026

View reviewed changes

BYK force-pushed the feat/skill-eval-framework branch from 353155f to 6644186 Compare March 31, 2026 10:52

cursor bot reviewed Mar 31, 2026

View reviewed changes

sentry bot reviewed Mar 31, 2026

View reviewed changes

Uh oh!

Conversation

BYK commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Running locally

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semver Impact of This PR

New Features ✨

Bug Fixes 🐛

Upgrade

Other

Documentation 📚

Internal Changes 🔧

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

Uh oh!

BYK commented Mar 30, 2026

Uh oh!

Uh oh!

betegon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Mar 31, 2026

Choose a reason for hiding this comment

Negative span depth now accepted instead of defaulting

Uh oh!

cursor bot Mar 31, 2026

Choose a reason for hiding this comment

Changelog whitespace section filtering was removed

Uh oh!

github-actions bot commented Mar 31, 2026

Built to branch gh-pages at 2026-03-31 10:52 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 31, 2026

Choose a reason for hiding this comment

Removed try-catch around JSON.parse for DB data

Uh oh!

sentry bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BYK commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-03-31 10:52 UTC.
Preview will be ready when the GitHub Pages deployment is complete.