feat: add StorageAdapter plugin system for third-party storage protocols#1432
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a store config uses an unrecognised protocol, settings.py now queries the adapter registry before raising an error. Registered plugin adapters are validated and default keys are applied; unknown protocols surface a clear message directing users to install a plugin package. Includes new test class TestGetStoreSpecPluginDelegation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The three else-branches in StorageBackend (_create_filesystem, _full_path, get_url) now query the adapter registry before falling back to the built-in behaviour. Registered plugin adapters are called for filesystem construction, path composition, and URL generation; unknown protocols still raise DataJointError. Includes new test class TestStorageBackendPluginDelegation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move imports to top of file and apply ruff-format to satisfy CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
59c47de to
cd93c19
Compare
Review — two simpler alternatives to considerGood work, Kushal. The need is real: third-party packages should be able to register new storage backends without monkey-patching. Before we commit to the ABC approach, I want to explore whether we can achieve the same result with less machinery — or with machinery that also improves the existing code. Background: what StorageBackend actually does on top of fsspecLooking at `storage.py`, there are five protocol-specific concerns:
~80% of I/O already delegates to fsspec directly. The protocol-specific code is concentrated in items 1-4, spread across ~26 `if self.protocol == "..."` branches. fsspec already has its own protocol discovery — `fsspec.filesystem("s3")` auto-discovers S3FileSystem. So the question: what does DataJoint need to add on top of fsspec's own plugin system? Alternative A — Generic fsspec passthroughThe lightest option: make `StorageBackend` accept any fsspec-supported protocol by default. No new ABC, no entry points, no registry. For unknown protocols:
~30 lines of changes. No new module. Works for any fsspec filesystem immediately. Downside: no config validation or custom path/URL logic for unknown protocols. Alternative B — Unified adapter model for ALL protocols (preferred)Route all backends through adapters — including the existing four. The 26 hardcoded `if self.protocol == "..."` branches become adapter implementations: ```python ``` Built-in adapters (File, S3, GCS, Azure) registered at import time. Third-party adapters discovered via entry points. `StorageBackend` becomes a thin dispatcher — one code path for all protocols. This eliminates the 26 hardcoded protocol branches, makes third-party backends first-class citizens alongside built-ins, and gives us the right abstraction. The user-facing API (`dj.config.stores`) doesn't change at all — this is purely internal refactoring. We test against the existing integration test suite and release within the 2.2.x series when confident. AssessmentYour current PR has the right instincts — ABC, entry points, lazy discovery. But it creates a two-tier system: hardcoded paths for built-in protocols, adapter path for everything else. That means maintaining both code paths going forward. I'd prefer Alternative B: if we're introducing the adapter abstraction, route everything through it. The existing integration tests validate that the built-in adapters behave correctly. Since the user API doesn't change, this is safe to land in 2.2.x with thorough testing. @kushalbakshi — would you be up for reworking this PR to route the four existing protocols through adapters as well? The `StorageAdapter` ABC and entry-point discovery from your PR are the right foundation; the additional work is extracting the existing protocol-specific logic from `StorageBackend` into adapter subclasses. |
dimitri-yatsenko
left a comment
There was a problem hiding this comment.
See my comment above — recommending a unified adapter model that routes all four built-in protocols through the same adapter system, not just new ones.
|
Thanks for the detailed review, Dimitri. I agree the two-tier pattern is a real architectural smell and that routing all backends through a unified interface is the right end state. Happy to own that migration — but I want to push back on bundling it into this PR, for three reasons. 1. Alt A (pure fsspec passthrough) isn't a substitute for the adapter layer. fsspec's entry-point discovery ( 2. The
These branches aren't incidental. 3. Unit test coverage on the built-in protocols is thin. The only dedicated unit test touching Proposal. Land this PR as-is — it's strictly additive ( I'm happy to own that workstream. Given the scope, I'd want to sequence it against other commitments rather than execute it immediately inside this change — doing it now at the cost of higher-priority items isn't the right trade-off, and bundling it here would force the design decisions under the wrong time pressure. If you're good with the staging, I'll open the follow-up issue today. |
|
Thanks for the detailed pushback, Kushal. I re-read the code and your three claims hold up where it matters most: the atomic-write contract in So: I agree with your proposal to land additive + follow-up. I've filed #1440 to track the unified-adapter migration, with Phase 0 (per-protocol unit-test scaffolding) as a hard prerequisite, the atomicity and recursive-op semantics called out as gating design questions, and per-protocol phasing ( Three things I'd like addressed on this PR before merge: 1. Asymmetric error handling in
In practice 2. Default duplication in 3. Entry-point discovery itself is untested. All 19 tests mutate A nit: please drop the 🤖 trailer from the PR description before merge. Once those land I'll approve. The two-tier pattern is real but it makes an already-implicit asymmetry explicit, and #1440 is the right place to resolve it properly. |
Three changes from Dimitri's pre-merge review: 1. Symmetric error handling in storage.py Add `_require_adapter` helper on `StorageBackend`. `_create_filesystem`, `_full_path`, and `get_url` now all raise the same `DataJointError` when an unknown protocol has no registered adapter, regardless of call order. Previously `_full_path` silently fell through to file-protocol logic and `get_url` silently emitted `protocol://path` for unknown protocols, which would silently produce wrong paths if a future refactor called either before filesystem creation. 2. Default consolidation in settings.py Extract `_apply_common_store_defaults` static helper. Removes the 7-line redundant setdefault block from the plugin branch. Drops the `location=""` default for plugins so adapters can list `location` in `required_keys` and surface a clear missing-key error instead of silently running with an empty path prefix. 3. Cover entry-point discovery in tests Add `TestEntryPointDiscovery` driving `_discover_adapters()` directly via `monkeypatch.setattr` on `importlib.metadata.entry_points`, plus a graceful-failure test that exercises the `except Exception` branch. Also extend the unsupported-protocol test to cover `_full_path` and `get_url` symmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the additional pass, Dimitri. Pushed 1. Symmetric error handling — added 2. Default consolidation — extracted 3. Discovery test — added
I also extended the symmetric-error coverage — Plus the nit: dropped the trailer from the PR description. Test counts: 23 unit tests in Ready for re-review. |
There was a problem hiding this comment.
Approving — ae0dfb5 closes out the three blockers cleanly.
1. Symmetric error handling. The _require_adapter helper in storage.py unifies the missing-adapter error across _create_filesystem, _full_path, and get_url. Bonus: _full_path now has an explicit elif self.protocol == "file": instead of the previous catch-all else: for file logic, so unknown protocols can no longer silently take the file path. Both new symmetry tests (test_unsupported_protocol_full_path_raises, test_unsupported_protocol_get_url_raises) lock that in.
2. Default consolidation. _apply_common_store_defaults is extracted as requested and called once at the top of get_store_spec. The deliberate decision to drop the location="" default from the plugin path so adapters can list it in required_keys is a nice judgment call — it lets plugins surface a clear missing-key error rather than silently running with an empty prefix.
3. Entry-point discovery covered. TestEntryPointDiscovery drives _discover_adapters() directly via monkeypatch.setattr on importlib.metadata.entry_points. The graceful-failure test (one EP raising RuntimeError, one succeeding) is well-shaped — it validates that a single bad plugin doesn't poison the registry and that the failure lands in the warning log with both name and exception message. Setup/teardown properly snapshot module-level state.
A minor gap I noticed: the fake entry_points only models the Python 3.10+ keyword API; the except TypeError fallback for Python <3.10 isn't exercised. Not blocking — flagging as a small follow-up if Python 3.9 support sticks around.
One cosmetic item to strip at squash-merge time: the Co-Authored-By: Claude Opus 4.7 (1M context) trailer in the ae0dfb5 commit message — project convention on master is to omit these.
The architectural sequencing (land additive here, track full unification in #1440 with Phase 0 test scaffolding as a hard prerequisite) was the right call. The atomicity contract in safe_write/safe_copy and the recursive-op semantics deserve a dedicated design pass, not a rushed bundle. LGTM.
Summary
Adds a
StorageAdapterentry-point plugin system that allows third-party packages to register new storage protocols without modifying DataJoint internals.StorageAdapterABC insrc/datajoint/storage_adapter.pywith 4 extension points:create_filesystem,validate_spec,full_path,get_urldatajoint.storagegroup (mirrors the existing codec plugin pattern incodecs.py)settings.py(get_store_spec) andstorage.py(_create_filesystem,_full_path,get_url)file,s3,gcs,azure)Motivation
DataJoint's external storage layer currently supports four hardcoded protocols. Extending it with custom fsspec-backed storage requires monkey-patching private methods — fragile, non-composable, and breaks on upstream refactors. This PR adds a clean plugin system so third-party packages can register storage protocols via standard Python entry points.
How third-party packages use it
No import needed — DataJoint auto-discovers the adapter when it encounters the protocol in
dj.config.stores.Test plan