Skip to content

Utf16 bom support#4326

Merged
zricethezav merged 3 commits into
trufflesecurity:mainfrom
joeleonjr:utf16-BOM
Aug 14, 2025
Merged

Utf16 bom support#4326
zricethezav merged 3 commits into
trufflesecurity:mainfrom
joeleonjr:utf16-BOM

Conversation

@joeleonjr

Copy link
Copy Markdown
Contributor

Description:

Secrets in UTF-16-Encoded Files are not always detected due to data chunk changes made in the UTF-8 extractSubstrings() function.

In engine.go, TH loops each decoder, passing the Chunk Data in for processing. The UTF-8 decoder runs first. If the data chunk is invalid UTF-8, the UTF-8 decoder will execute the function extractSubstrings(). The result of that function is applied to the Chunk's Data field, which is then passed into all subsequent decoders. Part of that function alters the data structure of valid UTF-16 data, making detecting some secrets impossible.

Here's an example to test out:

echo <VALID_DETECTABLE_SECRET> > secret.txt
printf '\xFF\xFE' > utf16le.txt && iconv -f UTF-8 -t UTF-16LE secret.txt >> utf16le.txt
printf '\xFF\xFE' > utf16le.txt && iconv -f UTF-8 -t UTF-16LE secret.txt >> utf16le.txt
trufflehog filesystem utf*

Originally, I thought the problem was we did not address the UTF-16 Byte Order Marks (BOM) #FEFF and #FFFE. However, the existing logic takes care of those in the utf16ToUTF8 function in utf16.go. I added two test cases to prove that.

The only change needed is creating a copy of the chunk prior to processing each decoder.

If that change is too expensive, I have 2 other ideas:

  1. Move extractSubstrings out from the UTF-8 decoder and invoke it directly engine.go prior to running FindDetectorMatches during a failed UTF-8 decode.
  2. Store the results of that function in a separate variable for later processing in FindDetectorMatches.

@zricethezav

Copy link
Copy Markdown
Contributor

@joeleonjr lgtm

@joeleonjr joeleonjr marked this pull request as ready for review August 8, 2025 14:15
@joeleonjr joeleonjr requested a review from a team as a code owner August 8, 2025 14:16
@joeleonjr joeleonjr requested a review from a team August 8, 2025 14:16
@zricethezav zricethezav merged commit c319bb8 into trufflesecurity:main Aug 14, 2025
13 checks passed
peterfraedrich pushed a commit to peterfraedrich/trufflehog that referenced this pull request Mar 15, 2026
* added UTF-16 BOM support

* removed BOM removal; doesn't make a difference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants