Skip to content

Fix SQLite corruption on bucket-mounted Spaces#501

Merged
abidlabs merged 9 commits into
mainfrom
single-writer-sqlite
Apr 16, 2026
Merged

Fix SQLite corruption on bucket-mounted Spaces#501
abidlabs merged 9 commits into
mainfrom
single-writer-sqlite

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 15, 2026

  • When Trackio runs on an HF Space with a bucket mount (hf-mount), the FUSE filesystem doesn't support file locking. SQLite depends on fcntl/flock for internal consistency, so even a single process can see corruption when locks are silently no-ops.
  • Detects bucket-mount environment (TRACKIO_BUCKET_ID + SYSTEM=spaces) and switches to a single persistent connection with PRAGMA locking_mode=EXCLUSIVE. SQLite grabs the lock once and never releases it — with only one connection open, there's no contention and no reliance on filesystem locking.
  • Also switches ProcessLock to use in-memory threading.Lock instead of file-based locking in the same environment.

This is a much simpler alternative to the Parquet storage backend approach (~110 lines changed vs ~3500) proposed by Claude Code and reviewed by myself to solve the same root cause.

cc @Wauplin @XciD for visibility

…ket-mounted Spaces

When Trackio runs on an HF Space with a bucket mount (hf-mount), the FUSE
filesystem doesn't support file locking. SQLite depends on fcntl/flock for
internal consistency, so even a single process can see corruption when locks
are silently no-ops.

Fix: detect bucket-mount environment (TRACKIO_BUCKET_ID + SYSTEM=spaces) and
switch to a single persistent connection with PRAGMA locking_mode=EXCLUSIVE.
This tells SQLite to grab the lock once and never release it -- with only one
connection open, there's no contention and no reliance on filesystem locking.

Also switches ProcessLock to use in-memory threading.Lock instead of file-based
locking in the same environment, since file locks on the mount are unreliable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 15, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 15, 2026

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio patch

  • Fix SQLite corruption on bucket-mounted Spaces

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 15, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/656398d01f18dbaf110edff38a4cd446992416ff/trackio-0.22.0-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses SQLite corruption seen on Hugging Face Spaces when Trackio runs on a bucket-mounted (FUSE) filesystem where file locking is unreliable, by switching to an exclusive-locking + single-connection strategy in that environment.

Changes:

  • Add bucket-mount detection (TRACKIO_BUCKET_ID + SYSTEM=spaces) and enable PRAGMA locking_mode=EXCLUSIVE.
  • Introduce per-DB persistent SQLite connections and serialize access via in-memory locks when exclusive locking is enabled.
  • Update ProcessLock to use an in-memory threading.Lock instead of file locks in the same environment, and add a unit test covering the exclusive mode behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
trackio/sqlite_storage.py Adds exclusive-locking mode detection, persistent connections, and switches locking strategy in bucket-mounted Spaces.
tests/unit/test_sqlite_storage.py Adds a unit test validating exclusive locking mode and persistent connection creation.
.changeset/rare-olives-find.md Adds a changeset entry documenting the feature/fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread trackio/sqlite_storage.py Outdated
Comment thread trackio/sqlite_storage.py
Comment thread trackio/sqlite_storage.py Outdated
Comment thread trackio/sqlite_storage.py Outdated
abidlabs and others added 2 commits April 15, 2026 19:02
…p, fix docstring

- Narrow `except Exception` to `except sqlite3.Error` in persistent connection
  health check and cleanup
- Add `_close_all_persistent_connections()` with `atexit` hook to prevent
  leaking file descriptors in long-running processes
- Clarify ProcessLock docstring: in-memory threading.Lock is single-process only
- Add comment explaining why `configure_pragmas` is intentionally ignored in
  the exclusive-locking path
- Remove `test_exclusive_locking_not_enabled_by_default` test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covered by the e2e Spaces tests which exercise the real bucket-mount
environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gradio-pr-bot and others added 3 commits April 16, 2026 02:05
TRACKIO_BUCKET_ID is not set as a Space variable, so checking for it
would miss the bucket-mount case. Simplify to just SYSTEM=spaces — a
single persistent connection is the right default for any Space
environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread tests/unit/test_utils.py
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated change, removes PERSISTANT_STORAGE_ENABLED as we no longer have persistent storage on Spaces

@abidlabs
Copy link
Copy Markdown
Member Author

This is relatively small so I'll go ahead and merge this in so that we can test this out in live spaces

@abidlabs abidlabs merged commit 06ea885 into main Apr 16, 2026
9 checks passed
Copy link
Copy Markdown
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice workaround!

@abidlabs
Copy link
Copy Markdown
Member Author

I investigated this a bit with @XciD and found that it probably wasn't a locking issue, it was the fact that previously (before this PR), we were trying to create a new SQlite connection every request, and some of them got dropped because the SQlite lock took a long time to resolve. Now we have a fixed single persistent connection that all requests use (each request just adds to an in-memory queue). I was not able to reproduce the stronger bucket-specific claim that HF bucket mounts silently ignore locks, and in fact, in my tests the mounted bucket behaved the same as a local /tmp file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants