Skip to content

fix: percent-encode username in URL to handle non-ASCII characters (fixes UnicodeDecodeError)#2848

Open
juliosuas wants to merge 12 commits intosherlock-project:masterfrom
juliosuas:fix/unicode-url-encoding-special-chars
Open

fix: percent-encode username in URL to handle non-ASCII characters (fixes UnicodeDecodeError)#2848
juliosuas wants to merge 12 commits intosherlock-project:masterfrom
juliosuas:fix/unicode-url-encoding-special-chars

Conversation

@juliosuas
Copy link
Copy Markdown

Description

Fixes #2730 — Sherlock crashes with UnicodeDecodeError when searching for usernames containing non-ASCII characters (accented letters, Cyrillic, CJK, etc.).

Root Cause

In sherlock(), the username is inserted into the URL template with only a manual space→%20 replacement:

url = interpolate_string(net_info["url"], username.replace(' ', '%20'))

When the username contains non-ASCII characters like 'É', the resulting URL is not properly encoded (e.g. https://example.com/user/Émile). If the server responds with a redirect, the Location header may contain the raw non-ASCII character in an encoding requests cannot decode, causing:

UnicodeDecodeError: 'utf-8' codec can't decode byte...

Fix

Replace the manual replacement with urllib.parse.quote(username, safe='/') which:

  • Percent-encodes all non-URL-safe characters, including non-ASCII ones
  • Preserves forward-slashes (safe='/') for sites that use path segments
  • Handles the space case as well (quote encodes spaces as %20 by default)
from urllib.parse import quote as url_quote
url = interpolate_string(net_info["url"], url_quote(username, safe='/'))

Before

sherlock 'Émile'
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3...

After

Sherlock scans the username cleanly without crashing.

Checklist

  • Syntax verified with ast.parse()
  • Existing test suite passes
  • urllib.parse is part of the Python standard library (no new dependencies)

GNOME VCS (sherlock-project#2804): Switch from response_url to API-based detection
using /api/v4/users?username={} endpoint (same approach as GitLab).
The previous response_url method failed because non-existent users
get 302-redirected to /users/sign_in instead of staying at the
profile URL.

Patched (sherlock-project#2805): Update domain from patched.sh to patched.to.
The site migrated domains, causing all lookups to fail with the
old URL.

Verified both fixes: GNOME VCS API returns user data for existing
users and [] for non-existent ones. Patched.to returns the expected
error message for invalid users on the new domain.

Fixes sherlock-project#2804
Fixes sherlock-project#2805
Patched.sh now redirects to patched.to, which returns HTTP 403
for all requests (both existing and non-existing users) due to
Cloudflare WAF. This causes guaranteed false positives.

GNOME VCS fix (API-based detection) is retained and passes F+/F-
validation locally.
Usernames containing non-ASCII characters (e.g. accented letters like
'Émile', Cyrillic, CJK, etc.) were being inserted verbatim into URLs,
causing UnicodeDecodeError crashes when a server responded with a
redirect whose Location header could not be decoded.

Replace the manual '.replace(" ", "%20")' with
'urllib.parse.quote(username, safe="/")' which correctly
percent-encodes all non-URL-safe characters, including non-ASCII ones,
while preserving forward-slashes used in path-based profile URLs.

Fixes sherlock-project#2730
@github-actions
Copy link
Copy Markdown
Contributor

Automatic validation of changes

Target F+ Check F- Check
Root-Me ✔️   Pass ✔️   Pass

@juliosuas
Copy link
Copy Markdown
Author

Thanks — CI validation passes. This fix resolves #2730 by replacing the bare username.replace(' ', '%20') with urllib.parse.quote(username, safe='/'), which correctly percent-encodes all non-URL-safe characters including accented letters (é, ñ, ü, etc.) that caused the UnicodeDecodeError when servers returned redirect headers containing those bytes.

urllib.parse is part of the standard library — no new dependencies added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash: UnicodeDecodeError on usernames with special characters

1 participant