-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Bug] SingleFile uploads incorrectly trigger HTTP GET on original URL, causing missing titles and crawling failures #2579
Description
Describe the Bug
When uploading a packaged HTML file via the /api/v1/bookmarks/singlefile endpoint (e.g., using the SingleFile extension), the backend still attempts to make an HTTP GET request to the original URL to determine its content type.
If the original website blocks crawlers, requires authentication, or is otherwise inaccessible, this pre-check fails. As a result, the crawler's parsing logic breaks, causing the bookmark to lose its title and other metadata, even though a perfectly valid HTML file was successfully uploaded.
Steps to Reproduce
Steps to reproduce the behavior:
Configure the SingleFile extension to upload to the /api/v1/bookmarks/singlefile endpoint as per the official documentation.
Visit a webpage that actively blocks crawlers or requires being logged in.
Use the SingleFile extension to package and upload the page to Karakeep.
Check the newly created bookmark in Karakeep.
Expected Behaviour
The system should recognize that a precrawledArchiveAssetId (the uploaded HTML file) was provided and entirely skip making any network requests to the original URL. The uploaded HTML file should be parsed directly to extract the correct title, description, and content.
Screenshots or Additional Context
I looked into the source code and found the root cause. This issue was introduced in commit 2743d9e (#1163), which added support for saving direct image/PDF bookmarks.
In apps/workers/workers/crawlerWorker.ts, inside the runCrawler function, the crawler unconditionally executes getContentType(url):
// crawlerWorker.ts - runCrawler function
const contentType = await getContentType(url, jobId, job.abortSignal);This HTTP GET request fires before the system checks for precrawledArchiveAssetId inside crawlAndParseUrl .
When getContentType hits an anti-bot restriction, it affects the crawling pipeline, even though crawlAndParseUrlitself correctly tries to read the precrawledArchiveAssetId.
Device Details
No response
Exact Karakeep Version
v0.31.0
Environment Details
Docker
Debug Logs
No response
Have you checked the troubleshooting guide?
- I have checked the troubleshooting guide and I haven't found a solution to my problem