Add HTML decoder for secret detection in HTML-formatted sources by alafiand · Pull Request #4840 · trufflesecurity/trufflehog

alafiand · 2026-03-25T21:00:13Z

Summary

Adds a new HTML decoder to the decoder pipeline that extracts clean text from HTML content, enabling secret detection in sources like MS Teams and Confluence that emit HTML rather than plain text.
Parses HTML to extract text nodes, high-signal attribute values (href, src, value, xlink:href, etc.), script/style/comment content, and code/pre blocks with proper token boundary preservation.
Handles syntax-highlight boundary detection (hljs-* classes), zero-width/invisible character stripping, URL decoding in attributes, and double-encoded HTML entity cleanup.
Adds HTML = 5 to the DecoderType protobuf enum and registers the decoder in DefaultDecoders().

Test plan

Unit tests covering: split secrets across tags, attribute extraction, URL decoding, script/style/comment content, code blocks, syntax-highlight boundaries, zero-width character stripping, double-encoded entities, feature flag gating
Integration tested via thog dev deploy with live MS Teams scanning (companion thog PR)

Made with Cursor

Note

Medium Risk
Adds a new decoder and protobuf enum value that changes the decoding pipeline output for HTML-like chunks, which could affect detection results and performance on large inputs. Risk is mitigated by heuristic gating and extensive unit tests, but it still touches core scanning behavior.

Overview
Adds a new HTML decoder to the default decoding pipeline so HTML-formatted inputs (e.g., Teams/Confluence) are normalized into detector-friendly text.

The decoder parses HTML to emit visible text plus high-signal attribute values (including data-* and xlink:href), preserves script/style/comment content, inserts newline boundaries for block elements and hljs-* syntax-highlight spans, and cleans up URL-encoding, residual/double-encoded entities, and zero-width/invisible characters.

Updates the DecoderType protobuf enum to include HTML = 5 and adds comprehensive unit tests covering extraction behavior, whitespace/token boundaries, and feature-flag gating.

^{Written by Cursor Bugbot for commit 2e42a11. This will update automatically on new commits. Configure here.}

Sources like MS Teams and Confluence emit HTML rather than plain text, causing secrets split across tags or embedded in attributes to be missed. This adds an HTML decoder to the pipeline that extracts text nodes, high-signal attribute values, script/style/comment content, and code blocks. It handles syntax-highlight boundary detection, zero-width character stripping, and double-encoded HTML entity decoding. Made-with: Cursor

CLAassistant · 2026-03-25T21:00:20Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-25T21:17:32Z

pkg/decoders/html.go

+		decoded, err := url.QueryUnescape(val)
+		if err == nil && decoded != val {
+			val = decoded
+		}


QueryUnescape corrupts secrets containing + characters

Medium Severity

url.QueryUnescape converts + to a space character, which can corrupt secrets in attribute values. This affects all high-signal attributes (value, content, alt, title, href, src, data-*), not just URL-containing ones. For example, an <input value="sk_test_EXAMPLE+KEY"/> would produce sk_test_EXAMPLE KEY, breaking detector regex matching. Using url.PathUnescape instead would decode %XX sequences while preserving + as a literal character, which is safer for secret detection.

Merge branch 'main' into dl.276-new-html-decoder

2e42a11

cursor bot reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML decoder for secret detection in HTML-formatted sources#4840

Add HTML decoder for secret detection in HTML-formatted sources#4840
alafiand wants to merge 2 commits intomainfrom
dl.276-new-html-decoder

alafiand commented Mar 25, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Mar 25, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alafiand commented Mar 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

CLAassistant commented Mar 25, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 25, 2026

Choose a reason for hiding this comment

QueryUnescape corrupts secrets containing + characters

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alafiand commented Mar 25, 2026 •

edited by cursor bot

Loading

`QueryUnescape` corrupts secrets containing `+` characters