fix: percent-encode username in URL to handle non-ASCII characters (fixes UnicodeDecodeError)#2848
Open
juliosuas wants to merge 12 commits intosherlock-project:masterfrom
Open
Conversation
GNOME VCS (sherlock-project#2804): Switch from response_url to API-based detection using /api/v4/users?username={} endpoint (same approach as GitLab). The previous response_url method failed because non-existent users get 302-redirected to /users/sign_in instead of staying at the profile URL. Patched (sherlock-project#2805): Update domain from patched.sh to patched.to. The site migrated domains, causing all lookups to fail with the old URL. Verified both fixes: GNOME VCS API returns user data for existing users and [] for non-existent ones. Patched.to returns the expected error message for invalid users on the new domain. Fixes sherlock-project#2804 Fixes sherlock-project#2805
Patched.sh now redirects to patched.to, which returns HTTP 403 for all requests (both existing and non-existing users) due to Cloudflare WAF. This causes guaranteed false positives. GNOME VCS fix (API-based detection) is retained and passes F+/F- validation locally.
Usernames containing non-ASCII characters (e.g. accented letters like
'Émile', Cyrillic, CJK, etc.) were being inserted verbatim into URLs,
causing UnicodeDecodeError crashes when a server responded with a
redirect whose Location header could not be decoded.
Replace the manual '.replace(" ", "%20")' with
'urllib.parse.quote(username, safe="/")' which correctly
percent-encodes all non-URL-safe characters, including non-ASCII ones,
while preserving forward-slashes used in path-based profile URLs.
Fixes sherlock-project#2730
Contributor
Automatic validation of changes
|
Author
|
Thanks — CI validation passes. This fix resolves #2730 by replacing the bare
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #2730 — Sherlock crashes with
UnicodeDecodeErrorwhen searching for usernames containing non-ASCII characters (accented letters, Cyrillic, CJK, etc.).Root Cause
In
sherlock(), the username is inserted into the URL template with only a manual space→%20replacement:When the username contains non-ASCII characters like
'É', the resulting URL is not properly encoded (e.g.https://example.com/user/Émile). If the server responds with a redirect, theLocationheader may contain the raw non-ASCII character in an encodingrequestscannot decode, causing:Fix
Replace the manual replacement with
urllib.parse.quote(username, safe='/')which:safe='/') for sites that use path segmentsquoteencodes spaces as%20by default)Before
After
Sherlock scans the username cleanly without crashing.
Checklist
ast.parse()urllib.parseis part of the Python standard library (no new dependencies)