Skip to content

[Bug] SingleFile uploads incorrectly trigger HTTP GET on original URL, causing missing titles and crawling failures #2579

@GSAlex

Description

@GSAlex

Describe the Bug

When uploading a packaged HTML file via the /api/v1/bookmarks/singlefile endpoint (e.g., using the SingleFile extension), the backend still attempts to make an HTTP GET request to the original URL to determine its content type.

If the original website blocks crawlers, requires authentication, or is otherwise inaccessible, this pre-check fails. As a result, the crawler's parsing logic breaks, causing the bookmark to lose its title and other metadata, even though a perfectly valid HTML file was successfully uploaded.

Steps to Reproduce

Steps to reproduce the behavior:

Configure the SingleFile extension to upload to the /api/v1/bookmarks/singlefile endpoint as per the official documentation.
Visit a webpage that actively blocks crawlers or requires being logged in.
Use the SingleFile extension to package and upload the page to Karakeep.
Check the newly created bookmark in Karakeep.

Expected Behaviour

The system should recognize that a precrawledArchiveAssetId (the uploaded HTML file) was provided and entirely skip making any network requests to the original URL. The uploaded HTML file should be parsed directly to extract the correct title, description, and content.

Screenshots or Additional Context

I looked into the source code and found the root cause. This issue was introduced in commit 2743d9e (#1163), which added support for saving direct image/PDF bookmarks.

In apps/workers/workers/crawlerWorker.ts, inside the runCrawler function, the crawler unconditionally executes getContentType(url):

// crawlerWorker.ts - runCrawler function
const contentType = await getContentType(url, jobId, job.abortSignal);

This HTTP GET request fires before the system checks for precrawledArchiveAssetId inside crawlAndParseUrl .

When getContentType hits an anti-bot restriction, it affects the crawling pipeline, even though crawlAndParseUrlitself correctly tries to read the precrawledArchiveAssetId.

Device Details

No response

Exact Karakeep Version

v0.31.0

Environment Details

Docker

Debug Logs

No response

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstatus/untriagedThis issue needs triaging to confirm it

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions