Rewrite encoding detection as inverted UTF-8 validator by pstreef · Pull Request #7213 · openrewrite/rewrite

pstreef · 2026-03-31T12:44:43Z

Problem

EncodingDetectingInputStream.guessCharset() enumerated known-bad UTF-8 byte patterns, which meant any pattern not explicitly listed would slip through undetected. This caused ISO-8859-1 files (e.g. containing German umlauts like ü = 0xFC) to be misdetected as UTF-8, leading to mod git apply failures.

Specific gaps: bytes 0xF8-0xFF fell through entirely, and 0xC0/0xC1 (overlong encodings forbidden by RFC 3629) were accepted as valid two-byte sequence leads.

Solution

Invert the detection logic: instead of listing invalid bytes, the new implementation tracks a remainingContinuationBytes counter and only accepts bytes that are explicitly valid in their position. Anything else triggers Windows-1252 fallback.

This is simpler (15 lines vs 30, one int field vs three booleans + two prev-byte trackers), and eliminates detection gaps by default — unknown bytes are rejected rather than silently accepted.

Also tightens the 4-byte lead range from 0xF0-0xF7 to 0xF0-0xF4 (0xF5-0xF7 would encode above U+10FFFF).

Bytes in the range 0xF8-0xFF are never valid in any position of a UTF-8 sequence, but the detection logic had no branch for them — they fell through without setting charset to Windows-1252. This caused ISO-8859-1 files containing characters like ü (0xFC) to be incorrectly detected as UTF-8, leading to `mod git apply` failures when patch bytes didn't match file bytes. Also reject 0xC0 and 0xC1 as UTF-8 lead bytes — these would encode code points below U+0080 (overlong encodings forbidden by RFC 3629).

Replace the enumeration-of-bad-bytes approach with a single remainingContinuationBytes counter. Any byte not explicitly valid in its position is rejected as non-UTF-8, which eliminates an entire class of detection gaps by default. Also tightens the 4-byte lead range to 0xF0-0xF4 (0xF5-0xF7 would encode code points above U+10FFFF).

Use CsvSource-based parameterized tests for byte-level validation, hex-encoded byte arrays for precise control over test inputs, and consolidate duplicated test patterns.

timtebeek · 2026-03-31T13:57:13Z

+    void detectsCharsetForEncodedStrings(String text, String sourceCharset, String expectedCharset) throws Exception {
+        try (EncodingDetectingInputStream is = read(text, Charset.forName(sourceCharset))) {
+            assertThat(is.getCharset()).isEqualTo(Charset.forName(expectedCharset));


Source and expected charset always match? Do we then even need both?

sambsnyd

This looks logical

github-project-automation Bot added this to OpenRewrite Mar 31, 2026

github-project-automation Bot moved this to In Progress in OpenRewrite Mar 31, 2026

moderne-meeseeks Bot assigned pstreef Mar 31, 2026

timtebeek approved these changes Mar 31, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress to Ready to Review in OpenRewrite Mar 31, 2026

pstreef force-pushed the pstreef/fix-charset-detect branch from 68fcfd4 to c1b179f Compare March 31, 2026 13:14

pstreef changed the title ~~Fix encoding detection for bytes 0xF8-0xFF~~ Rewrite encoding detection as inverted UTF-8 validator Mar 31, 2026

pstreef marked this pull request as draft March 31, 2026 13:18

pstreef force-pushed the pstreef/fix-charset-detect branch 3 times, most recently from 03fdac0 to 4eb482e Compare March 31, 2026 13:32

Simplify encoding detection tests with @ParameterizedTest

b3a491d

Use CsvSource-based parameterized tests for byte-level validation, hex-encoded byte arrays for precise control over test inputs, and consolidate duplicated test patterns.

pstreef force-pushed the pstreef/fix-charset-detect branch from 4eb482e to b3a491d Compare March 31, 2026 13:35

Merge branch 'main' into pstreef/fix-charset-detect

70594fa

pstreef marked this pull request as ready for review March 31, 2026 13:41

pstreef requested a review from timtebeek March 31, 2026 13:41

kmccarp approved these changes Mar 31, 2026

View reviewed changes

timtebeek reviewed Mar 31, 2026

View reviewed changes

sambsnyd approved these changes Mar 31, 2026

View reviewed changes

pstreef merged commit bdb64b5 into main Apr 1, 2026
1 check passed

pstreef deleted the pstreef/fix-charset-detect branch April 1, 2026 06:54

github-project-automation Bot moved this from Ready to Review to Done in OpenRewrite Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite encoding detection as inverted UTF-8 validator#7213

Rewrite encoding detection as inverted UTF-8 validator#7213
pstreef merged 4 commits intomainfrom
pstreef/fix-charset-detect

pstreef commented Mar 31, 2026 •

edited

Loading

Uh oh!

timtebeek Mar 31, 2026

Uh oh!

sambsnyd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pstreef commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

timtebeek Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

sambsnyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pstreef commented Mar 31, 2026 •

edited

Loading