Skip to content

Add HTML decoder for secret detection in HTML-formatted sources#4840

Draft
alafiand wants to merge 2 commits intomainfrom
dl.276-new-html-decoder
Draft

Add HTML decoder for secret detection in HTML-formatted sources#4840
alafiand wants to merge 2 commits intomainfrom
dl.276-new-html-decoder

Conversation

@alafiand
Copy link
Copy Markdown

@alafiand alafiand commented Mar 25, 2026

Summary

  • Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
  • Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
  • Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
  • Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().

Test plan

  • Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
  • Integration tested via thog dev deploy with live MS Teams scanning (companion thog PR)

Made with Cursor


Note

Medium Risk
Adds a new decoder and protobuf enum value that changes the decoding pipeline output for HTML-like chunks, which could affect detection results and performance on large inputs. Risk is mitigated by heuristic gating and extensive unit tests, but it still touches core scanning behavior.

Overview
Adds a new HTML decoder to the default decoding pipeline so HTML-formatted inputs (e.g., Teams/Confluence) are normalized into detector-friendly text.

The decoder parses HTML to emit visible text plus high-signal attribute values (including data-* and xlink:href), preserves script/style/comment content, inserts newline boundaries for block elements and hljs-* syntax-highlight spans, and cleans up URL-encoding, residual/double-encoded entities, and zero-width/invisible characters.

Updates the DecoderType protobuf enum to include HTML = 5 and adds comprehensive unit tests covering extraction behavior, whitespace/token boundaries, and feature-flag gating.

Written by Cursor Bugbot for commit 2e42a11. This will update automatically on new commits. Configure here.

Sources like MS Teams and Confluence emit HTML rather than plain text,
causing secrets split across tags or embedded in attributes to be missed.
This adds an HTML decoder to the pipeline that extracts text nodes,
high-signal attribute values, script/style/comment content, and code blocks.
It handles syntax-highlight boundary detection, zero-width character stripping,
and double-encoded HTML entity decoding.

Made-with: Cursor
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

decoded, err := url.QueryUnescape(val)
if err == nil && decoded != val {
val = decoded
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueryUnescape corrupts secrets containing + characters

Medium Severity

url.QueryUnescape converts + to a space character, which can corrupt secrets in attribute values. This affects all high-signal attributes (value, content, alt, title, href, src, data-*), not just URL-containing ones. For example, an <input value="sk_test_EXAMPLE+KEY"/> would produce sk_test_EXAMPLE KEY, breaking detector regex matching. Using url.PathUnescape instead would decode %XX sequences while preserving + as a literal character, which is safer for secret detection.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants