Rewrite encoding detection as inverted UTF-8 validator#7213
Merged
Conversation
timtebeek
approved these changes
Mar 31, 2026
Bytes in the range 0xF8-0xFF are never valid in any position of a UTF-8 sequence, but the detection logic had no branch for them — they fell through without setting charset to Windows-1252. This caused ISO-8859-1 files containing characters like ü (0xFC) to be incorrectly detected as UTF-8, leading to `mod git apply` failures when patch bytes didn't match file bytes. Also reject 0xC0 and 0xC1 as UTF-8 lead bytes — these would encode code points below U+0080 (overlong encodings forbidden by RFC 3629).
68fcfd4 to
c1b179f
Compare
Replace the enumeration-of-bad-bytes approach with a single remainingContinuationBytes counter. Any byte not explicitly valid in its position is rejected as non-UTF-8, which eliminates an entire class of detection gaps by default. Also tightens the 4-byte lead range to 0xF0-0xF4 (0xF5-0xF7 would encode code points above U+10FFFF).
03fdac0 to
4eb482e
Compare
Use CsvSource-based parameterized tests for byte-level validation, hex-encoded byte arrays for precise control over test inputs, and consolidate duplicated test patterns.
4eb482e to
b3a491d
Compare
kmccarp
approved these changes
Mar 31, 2026
timtebeek
reviewed
Mar 31, 2026
Comment on lines
+90
to
+92
| void detectsCharsetForEncodedStrings(String text, String sourceCharset, String expectedCharset) throws Exception { | ||
| try (EncodingDetectingInputStream is = read(text, Charset.forName(sourceCharset))) { | ||
| assertThat(is.getCharset()).isEqualTo(Charset.forName(expectedCharset)); |
Member
There was a problem hiding this comment.
Source and expected charset always match? Do we then even need both?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
EncodingDetectingInputStream.guessCharset()enumerated known-bad UTF-8 byte patterns, which meant any pattern not explicitly listed would slip through undetected. This caused ISO-8859-1 files (e.g. containing German umlauts likeü=0xFC) to be misdetected as UTF-8, leading tomod git applyfailures.Specific gaps: bytes 0xF8-0xFF fell through entirely, and 0xC0/0xC1 (overlong encodings forbidden by RFC 3629) were accepted as valid two-byte sequence leads.
Solution
Invert the detection logic: instead of listing invalid bytes, the new implementation tracks a
remainingContinuationBytescounter and only accepts bytes that are explicitly valid in their position. Anything else triggers Windows-1252 fallback.This is simpler (15 lines vs 30, one int field vs three booleans + two prev-byte trackers), and eliminates detection gaps by default — unknown bytes are rejected rather than silently accepted.
Also tightens the 4-byte lead range from 0xF0-0xF7 to 0xF0-0xF4 (0xF5-0xF7 would encode above U+10FFFF).