Skip to content

Fix host validation: IPv6 zone ID characters and NFKC percent bypass#1655

Open
rodrigobnogueira wants to merge 4 commits intoaio-libs:masterfrom
rodrigobnogueira:fix-host-validation
Open

Fix host validation: IPv6 zone ID characters and NFKC percent bypass#1655
rodrigobnogueira wants to merge 4 commits intoaio-libs:masterfrom
rodrigobnogueira:fix-host-validation

Conversation

@rodrigobnogueira
Copy link
Copy Markdown
Member

@rodrigobnogueira rodrigobnogueira commented Apr 12, 2026

What do these changes do?

Fixes two incomplete validation cases in host parsing.


Finding 1 — IPv6 zone ID character validation

The bug: URL.build() with validate_host=True (the default) did not validate the zone ID portion of IPv6 addresses. Any character — including CR, LF, null bytes — could be embedded in url.host via URL.build(host='::1%<bad>'). This creates an asymmetry:

# Correctly rejected:
URL.build(scheme='http', host='example.com\r\nX-Injected: evil')
# ValueError: Host '...' cannot contain '\r' ...

# Was NOT rejected (bug):
URL.build(scheme='http', host='::1%\r\nX-Injected: evil', path='/')
# url.host = '::1%\r\nX-Injected: evil'

Root cause: _encode_host() validates the IP address portion via ip_address() but the zone string was concatenated back verbatim without any character inspection.

Fix: Added _ZONE_ID_UNSAFE_RE to reject ASCII control characters (CTL, \x00–\x1f and \x7f) in the zone ID. Per RFC 4007 §11.2, zone IDs are OS-specific text strings with no defined format. RFC 9844 §6.3 recommends rejecting characters inappropriate for the environment; for yarl we reject ASCII control characters. All other characters — spaces, Unicode, parentheses, etc. — are accepted.


Finding 2 — NFKC fullwidth/small percent sign bypass

The bug: _check_netloc() normalizes the netloc via NFKC and checks for URL-reserved characters, but % was missing from the checked set. U+FF05 (FULLWIDTH PERCENT SIGN ) and U+FE6A (SMALL PERCENT SIGN ) both normalize to % under NFKC, so they passed validation and ultimately produced a literal % in url.host via the standard library IDNA fallback in _idna_encode():

URL('http://evil.com\uff052e.internal/').host
# 'evil.com%2e.internal'   ← literal % injected

Root cause: The idna library correctly rejects % as invalid in a hostname label, but the except UnicodeError fallback to standard library host.encode('idna') does its own NFKC normalization and silently accepts the character.

Fix: Add % to the character set checked after NFKC normalization in _check_netloc().


Changes

File Change
yarl/_url.py Add _ZONE_ID_UNSAFE_RE regex; validate zone ID in _encode_host() by rejecting CTL characters
yarl/_parse.py Add % to NFKC character check in _check_netloc()
tests/test_url_build.py Tests for zone ID CTL rejection (CRLF, null byte) and acceptance of non-CTL (spaces, Unicode, parens)
tests/test_url.py Tests for U+FF05 and U+FE6A being rejected

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Apr 12, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 12, 2026

Merging this PR will not alter performance

✅ 99 untouched benchmarks


Comparing rodrigobnogueira:fix-host-validation (eb19850) with master (3ca1a90)

Open in CodSpeed

Finding 1: IPv6 zone IDs were not validated even when validate_host=True.
Any character — including CR, LF, and null bytes — could be embedded in
url.host via URL.build(host='::1%<bad>'). This creates an asymmetry: regular
hostnames are correctly rejected for control characters but zone IDs were
passed through verbatim.

Fix: add _ZONE_ID_RE regex (RFC 6874 unreserved + sub-delims) and validate
the zone portion of IPv6 addresses in _encode_host() when validate_host=True.

Finding 2: _check_netloc() normalizes the netloc via NFKC and checks for
URL-reserved characters but '%' was missing from the checked set. U+FF05
(FULLWIDTH PERCENT SIGN) and U+FE6A (SMALL PERCENT SIGN) both normalize to
'%' under NFKC and were accepted, ultimately producing a literal '%' in
url.host via the stdlib IDNA fallback in _idna_encode().

Fix: add '%' to the character set checked in _check_netloc().
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.48%. Comparing base (2f180d1) to head (eb19850).
⚠️ Report is 1 commits behind head on master.

❌ Your project check has failed because the head coverage (97.65%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1655   +/-   ##
=======================================
  Coverage   99.47%   99.48%           
=======================================
  Files          30       30           
  Lines        5942     5993   +51     
  Branches      283      285    +2     
=======================================
+ Hits         5911     5962   +51     
  Misses         22       22           
  Partials        9        9           
Flag Coverage Δ
CI-GHA 99.48% <100.00%> (+<0.01%) ⬆️
MyPy 97.65% <100.00%> (+0.02%) ⬆️
OS-Linux 99.71% <100.00%> (+<0.01%) ⬆️
OS-Windows 98.43% <100.00%> (+0.01%) ⬆️
OS-macOS 98.60% <100.00%> (+0.03%) ⬆️
Py-3.10.11 98.40% <100.00%> (+0.01%) ⬆️
Py-3.10.20 99.63% <100.00%> (+<0.01%) ⬆️
Py-3.11.15 99.63% <100.00%> (+<0.01%) ⬆️
Py-3.11.9 98.40% <100.00%> (+0.01%) ⬆️
Py-3.12.10 98.40% <100.00%> (+0.01%) ⬆️
Py-3.12.13 99.63% <100.00%> (+<0.01%) ⬆️
Py-3.13.12 ?
Py-3.13.13 99.71% <100.00%> (?)
Py-3.13.13t 99.71% <100.00%> (+0.02%) ⬆️
Py-3.14.3 ?
Py-3.14.4 99.70% <100.00%> (?)
Py-3.14.4t 99.70% <100.00%> (+0.02%) ⬆️
Py-pypy3.10.16-7.3.19 ?
VM-macos-latest 98.60% <100.00%> (+0.03%) ⬆️
VM-ubuntu-latest 99.71% <100.00%> (+<0.01%) ⬆️
VM-windows-latest 98.43% <100.00%> (+0.01%) ⬆️
pytest 99.73% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread tests/test_url_build.py Outdated
Comment thread yarl/_url.py Outdated
- Remove unused 'desc' parameter from zone ID test parametrize tuple
- Update _ZONE_ID_RE comment: cite RFC 9844 (which obsoletes RFC 6874
  for UI usage) and add a direct link to RFC 6874 §2 for the ZoneID
  ABNF grammar (unreserved / pct-encoded)
Comment thread yarl/_url.py Outdated
The _ZONE_ID_RE allowlist was based on RFC 6874's ABNF grammar, which
was overly restrictive. RFC 4007 §11.2 specifies that zone IDs are
OS-defined text strings with no format restriction (interface names like
'eth0', 'Ethernet (LAN)', and numeric indices are all valid).

RFC 9844 §6.3 recommends rejecting characters inappropriate for the
environment. For yarl this means ASCII control characters (CTL).

Changes:
- Replace _ZONE_ID_RE with _ZONE_ID_UNSAFE_RE that rejects CTL chars
- Accept empty-zone check (::1% is still invalid)
- Update tests: remove 'spaces' from invalid cases, add valid cases
- Update changelog to cite RFC 9844 §6.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants