Fix host validation: IPv6 zone ID characters and NFKC percent bypass#1655
Open
rodrigobnogueira wants to merge 4 commits intoaio-libs:masterfrom
Open
Fix host validation: IPv6 zone ID characters and NFKC percent bypass#1655rodrigobnogueira wants to merge 4 commits intoaio-libs:masterfrom
rodrigobnogueira wants to merge 4 commits intoaio-libs:masterfrom
Conversation
f837184 to
a4b2265
Compare
Finding 1: IPv6 zone IDs were not validated even when validate_host=True. Any character — including CR, LF, and null bytes — could be embedded in url.host via URL.build(host='::1%<bad>'). This creates an asymmetry: regular hostnames are correctly rejected for control characters but zone IDs were passed through verbatim. Fix: add _ZONE_ID_RE regex (RFC 6874 unreserved + sub-delims) and validate the zone portion of IPv6 addresses in _encode_host() when validate_host=True. Finding 2: _check_netloc() normalizes the netloc via NFKC and checks for URL-reserved characters but '%' was missing from the checked set. U+FF05 (FULLWIDTH PERCENT SIGN) and U+FE6A (SMALL PERCENT SIGN) both normalize to '%' under NFKC and were accepted, ultimately producing a literal '%' in url.host via the stdlib IDNA fallback in _idna_encode(). Fix: add '%' to the character set checked in _check_netloc().
5515ca2 to
f7aa293
Compare
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (97.65%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #1655 +/- ##
=======================================
Coverage 99.47% 99.48%
=======================================
Files 30 30
Lines 5942 5993 +51
Branches 283 285 +2
=======================================
+ Hits 5911 5962 +51
Misses 22 22
Partials 9 9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Dreamsorcerer
approved these changes
Apr 13, 2026
- Remove unused 'desc' parameter from zone ID test parametrize tuple - Update _ZONE_ID_RE comment: cite RFC 9844 (which obsoletes RFC 6874 for UI usage) and add a direct link to RFC 6874 §2 for the ZoneID ABNF grammar (unreserved / pct-encoded)
The _ZONE_ID_RE allowlist was based on RFC 6874's ABNF grammar, which was overly restrictive. RFC 4007 §11.2 specifies that zone IDs are OS-defined text strings with no format restriction (interface names like 'eth0', 'Ethernet (LAN)', and numeric indices are all valid). RFC 9844 §6.3 recommends rejecting characters inappropriate for the environment. For yarl this means ASCII control characters (CTL). Changes: - Replace _ZONE_ID_RE with _ZONE_ID_UNSAFE_RE that rejects CTL chars - Accept empty-zone check (::1% is still invalid) - Update tests: remove 'spaces' from invalid cases, add valid cases - Update changelog to cite RFC 9844 §6.3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What do these changes do?
Fixes two incomplete validation cases in host parsing.
Finding 1 — IPv6 zone ID character validation
The bug:
URL.build()withvalidate_host=True(the default) did not validate the zone ID portion of IPv6 addresses. Any character — including CR, LF, null bytes — could be embedded inurl.hostviaURL.build(host='::1%<bad>'). This creates an asymmetry:Root cause:
_encode_host()validates the IP address portion viaip_address()but the zone string was concatenated back verbatim without any character inspection.Fix: Added
_ZONE_ID_UNSAFE_REto reject ASCII control characters (CTL,\x00–\x1fand\x7f) in the zone ID. Per RFC 4007 §11.2, zone IDs are OS-specific text strings with no defined format. RFC 9844 §6.3 recommends rejecting characters inappropriate for the environment; for yarl we reject ASCII control characters. All other characters — spaces, Unicode, parentheses, etc. — are accepted.Finding 2 — NFKC fullwidth/small percent sign bypass
The bug:
_check_netloc()normalizes the netloc via NFKC and checks for URL-reserved characters, but%was missing from the checked set. U+FF05 (FULLWIDTH PERCENT SIGN%) and U+FE6A (SMALL PERCENT SIGN﹪) both normalize to%under NFKC, so they passed validation and ultimately produced a literal%inurl.hostvia the standard library IDNA fallback in_idna_encode():Root cause: The
idnalibrary correctly rejects%as invalid in a hostname label, but theexcept UnicodeErrorfallback to standard libraryhost.encode('idna')does its own NFKC normalization and silently accepts the character.Fix: Add
%to the character set checked after NFKC normalization in_check_netloc().Changes
yarl/_url.py_ZONE_ID_UNSAFE_REregex; validate zone ID in_encode_host()by rejecting CTL charactersyarl/_parse.py%to NFKC character check in_check_netloc()tests/test_url_build.pytests/test_url.py