Skip to content

PETSc downloads broken, because the site only seems to accept certain User-Agent values in the header #4925

@casparvl

Description

@casparvl

Issue

$ eb --fetch PETSc-3.20.3-foss-2023a.eb --force-download
Couldn't find file petsc-3.20.3.tar.gz anywhere, and downloading it didn't work either... Paths attempted (in order): ...
/home/casparl/.local/easybuild/sources/petsc-3.20.3.tar.gz, https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.20.3.tar.gz, https://sources.easybuild.io/p/PETSc/petsc-3.20.3.tar.gz  (took 0 secs)

In the logs:

== 2025-06-17 11:24:31,727 filetools.py:903 INFO Attempt 1 of downloading https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.20.3.tar.gz to /home/casparl/.local/easybuild/sources/p/PETSc/petsc-3.20.3.tar.gz failed, trying again...
== 2025-06-17 11:24:31,727 filetools.py:908 INFO Downloading using requests package instead of urllib2
== 2025-06-17 11:24:32,037 filetools.py:886 WARNING URL https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.20.3.tar.gz was not found (HTTP response code 403), not trying again
== 2025-06-17 11:24:32,037 filetools.py:923 WARNING Download of https://web.cels.anl.gov/projects/petsc/download/release-snapshots/petsc-3.20.3.tar.gz to /home/casparl/.local/easybuild/sources/p/PETSc/petsc-3.20.3.tar.gz failed, done trying
== 2025-06-17 11:24:32,105 filetools.py:886 WARNING URL https://sources.easybuild.io/p/PETSc/petsc-3.20.3.tar.gz was not found (HTTP response code 404), not trying again
== 2025-06-17 11:24:32,106 filetools.py:923 WARNING Download of https://sources.easybuild.io/p/PETSc/petsc-3.20.3.tar.gz to /home/casparl/.local/easybuild/sources/p/PETSc/petsc-3.20.3.tar.gz failed, done trying

Cause

In EasyBuild

 # use custom HTTP header
    headers = {'User-Agent': 'EasyBuild', 'Accept': '*/*'}

Trying this interactively, I see:

>>> headers = {
...     "User-Agent": "EasyBuild",
...     "Accept": "*/*",
... }
>>> response = requests.get(url, headers=headers)
>>> response.status_code
403

However, when we pretend to be Wget:

>>> headers = {
...     "User-Agent": "Wget/1.21.1",
...     "Accept": "*/*",
... }
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200

According to my AI friend, some sites only allow specific values for the User-Agent, and block the rest (e.g. to avoid scraping etc). We could make our header such that we pretend to be Wget. Note that we could even do that as a third option, if all else fails (currently option one is using urllib, option 2 is using requests).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions