Add HTML decoder for secret detection in HTML-formatted sources#4840
Add HTML decoder for secret detection in HTML-formatted sources#4840
Conversation
Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor
|
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| decoded, err := url.QueryUnescape(val) | ||
| if err == nil && decoded != val { | ||
| val = decoded | ||
| } |
There was a problem hiding this comment.
QueryUnescape corrupts secrets containing + characters
Medium Severity
url.QueryUnescape converts + to a space character, which can corrupt secrets in attribute values. This affects all high-signal attributes (value, content, alt, title, href, src, data-*), not just URL-containing ones. For example, an <input value="sk_test_EXAMPLE+KEY"/> would produce sk_test_EXAMPLE KEY, breaking detector regex matching. Using url.PathUnescape instead would decode %XX sequences while preserving + as a literal character, which is safer for secret detection.


Summary
HTMLdecoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.href,src,value,xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.HTML = 5to theDecoderTypeprotobuf enum and registers the decoder inDefaultDecoders().Test plan
Made with Cursor
Note
Medium Risk
Adds a new decoder and protobuf enum value that changes the decoding pipeline output for HTML-like chunks, which could affect detection results and performance on large inputs. Risk is mitigated by heuristic gating and extensive unit tests, but it still touches core scanning behavior.
Overview
Adds a new
HTMLdecoder to the default decoding pipeline so HTML-formatted inputs (e.g., Teams/Confluence) are normalized into detector-friendly text.The decoder parses HTML to emit visible text plus high-signal attribute values (including
data-*andxlink:href), preserves script/style/comment content, inserts newline boundaries for block elements andhljs-*syntax-highlight spans, and cleans up URL-encoding, residual/double-encoded entities, and zero-width/invisible characters.Updates the
DecoderTypeprotobuf enum to includeHTML = 5and adds comprehensive unit tests covering extraction behavior, whitespace/token boundaries, and feature-flag gating.Written by Cursor Bugbot for commit 2e42a11. This will update automatically on new commits. Configure here.