Skip to content

Rewrite encoding detection as inverted UTF-8 validator#7213

Merged
pstreef merged 4 commits intomainfrom
pstreef/fix-charset-detect
Apr 1, 2026
Merged

Rewrite encoding detection as inverted UTF-8 validator#7213
pstreef merged 4 commits intomainfrom
pstreef/fix-charset-detect

Conversation

@pstreef
Copy link
Copy Markdown
Contributor

@pstreef pstreef commented Mar 31, 2026

Problem

EncodingDetectingInputStream.guessCharset() enumerated known-bad UTF-8 byte patterns, which meant any pattern not explicitly listed would slip through undetected. This caused ISO-8859-1 files (e.g. containing German umlauts like ü = 0xFC) to be misdetected as UTF-8, leading to mod git apply failures.

Specific gaps: bytes 0xF8-0xFF fell through entirely, and 0xC0/0xC1 (overlong encodings forbidden by RFC 3629) were accepted as valid two-byte sequence leads.

Solution

Invert the detection logic: instead of listing invalid bytes, the new implementation tracks a remainingContinuationBytes counter and only accepts bytes that are explicitly valid in their position. Anything else triggers Windows-1252 fallback.

This is simpler (15 lines vs 30, one int field vs three booleans + two prev-byte trackers), and eliminates detection gaps by default — unknown bytes are rejected rather than silently accepted.

Also tightens the 4-byte lead range from 0xF0-0xF7 to 0xF0-0xF4 (0xF5-0xF7 would encode above U+10FFFF).

@github-project-automation github-project-automation Bot moved this from In Progress to Ready to Review in OpenRewrite Mar 31, 2026
Bytes in the range 0xF8-0xFF are never valid in any position of a UTF-8
sequence, but the detection logic had no branch for them — they fell
through without setting charset to Windows-1252. This caused ISO-8859-1
files containing characters like ü (0xFC) to be incorrectly detected as
UTF-8, leading to `mod git apply` failures when patch bytes didn't match
file bytes.

Also reject 0xC0 and 0xC1 as UTF-8 lead bytes — these would encode
code points below U+0080 (overlong encodings forbidden by RFC 3629).
@pstreef pstreef force-pushed the pstreef/fix-charset-detect branch from 68fcfd4 to c1b179f Compare March 31, 2026 13:14
Replace the enumeration-of-bad-bytes approach with a single
remainingContinuationBytes counter. Any byte not explicitly valid
in its position is rejected as non-UTF-8, which eliminates an
entire class of detection gaps by default.

Also tightens the 4-byte lead range to 0xF0-0xF4 (0xF5-0xF7
would encode code points above U+10FFFF).
@pstreef pstreef changed the title Fix encoding detection for bytes 0xF8-0xFF Rewrite encoding detection as inverted UTF-8 validator Mar 31, 2026
@pstreef pstreef marked this pull request as draft March 31, 2026 13:18
@pstreef pstreef force-pushed the pstreef/fix-charset-detect branch 3 times, most recently from 03fdac0 to 4eb482e Compare March 31, 2026 13:32
Use CsvSource-based parameterized tests for byte-level validation,
hex-encoded byte arrays for precise control over test inputs, and
consolidate duplicated test patterns.
@pstreef pstreef force-pushed the pstreef/fix-charset-detect branch from 4eb482e to b3a491d Compare March 31, 2026 13:35
@pstreef pstreef marked this pull request as ready for review March 31, 2026 13:41
@pstreef pstreef requested a review from timtebeek March 31, 2026 13:41
Comment on lines +90 to +92
void detectsCharsetForEncodedStrings(String text, String sourceCharset, String expectedCharset) throws Exception {
try (EncodingDetectingInputStream is = read(text, Charset.forName(sourceCharset))) {
assertThat(is.getCharset()).isEqualTo(Charset.forName(expectedCharset));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source and expected charset always match? Do we then even need both?

Copy link
Copy Markdown
Member

@sambsnyd sambsnyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks logical

@pstreef pstreef merged commit bdb64b5 into main Apr 1, 2026
1 check passed
@pstreef pstreef deleted the pstreef/fix-charset-detect branch April 1, 2026 06:54
@github-project-automation github-project-automation Bot moved this from Ready to Review to Done in OpenRewrite Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants