Commit 3b9b28b
committed
fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output
`gc.scan_hash_references` and `gc.scan_schema_references` were calling
`table.to_arrays(attr_name)`, which routes through `Expression.to_dicts`
(`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for
each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`,
`<npy@>`, `<object@>`) this means downloading the bytes from external
storage and deserializing them into the codec's runtime type — a
`numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local
file-path string — none of which satisfy `_extract_*_refs`'s
`isinstance(value, dict) and "path" in value` check.
Both helpers therefore silently returned empty reference sets. Every
populated schema reported `hash_referenced: 0` and
`schema_paths_referenced: 0`, so every external file looked orphaned to
`scan()` and a subsequent `collect()` would have deleted live data. The
broad `try/except Exception` around the loop never fired because no
exception was raised — `_extract_*_refs` returns `[]` for unrecognized
shapes by design.
The intended design (per `reference/specs/type-system.md`) is for GC to
operate on the raw stored JSON metadata, not the decoded payload.
Replace `table.to_arrays(attr_name)` with
`table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor
yields the JSON column value directly: a Python dict on PostgreSQL
(JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs`
already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict
branch), so this is backend-agnostic with no adapter dispatch.
Side effect — `scan` is now a metadata-only operation. Previously it
downloaded every external blob just to deserialize and discard the
result via the silent type mismatch; on a 1 TB schema that meant 1 TB
of egress to produce `referenced: 0`. After this change, scan touches
only the JSON column on the database.
Custom-codec authors are unaffected: reference discovery operates on
the raw stored metadata regardless of what the codec's `decode()`
returns, so third-party codecs following the documented
`encode`/`decode` contract get correct GC for free.
Tests
-----
The existing `tests/integration/test_gc.py` mocks `scan_hash_references`,
`scan_schema_references`, and `list_*_paths` directly, so the production
code path through `to_arrays` → `decode_attribute` was never exercised
end-to-end. The mocked tests stay (they cover orchestration: composition
with `list_*_paths`, dry-run vs real, stat-key shape, format strings).
Add a `TestScanWithLiveData` class with three non-mocked end-to-end
tests, one per structurally distinct decoded-value type:
- `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray)
- `test_scan_finds_active_npy_reference` — `<npy@>` (decode → NpyRef)
- `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef)
Each declares a small manual table, inserts one row, and asserts
`scan(schema, store_name='local')` reports the expected `*_referenced`
count > 0. Verified to fail on the pre-fix code:
`{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`.
Adjacent
--------
Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The
`gc.py` module docstring and the user docs at
`how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`,
which previously raised `AttributeError` because `gc` wasn't lazily
exposed at the package level. Pattern matches the existing
`"diagram": (".diagram", None)` entry.
Out of scope
------------
GC remains non-transaction-safe even after this fix — there is a TOCTOU
window between scan and delete during which a concurrent transaction
could insert a row referencing what looks like an orphan. A two-phase
retrieval/removal API (quarantine → grace window → purge) is the right
remedy and will be tracked as a separate enhancement issue.
Fixes #14421 parent db42c26 commit 3b9b28b
3 files changed
Lines changed: 135 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
275 | 275 | | |
276 | 276 | | |
277 | 277 | | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
278 | 281 | | |
279 | 282 | | |
280 | 283 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
229 | 229 | | |
230 | 230 | | |
231 | 231 | | |
232 | | - | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
233 | 236 | | |
234 | | - | |
235 | | - | |
236 | | - | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
237 | 240 | | |
238 | 241 | | |
239 | 242 | | |
| |||
291 | 294 | | |
292 | 295 | | |
293 | 296 | | |
294 | | - | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
295 | 301 | | |
296 | | - | |
297 | | - | |
298 | | - | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
299 | 305 | | |
300 | 306 | | |
301 | 307 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
| 10 | + | |
9 | 11 | | |
10 | 12 | | |
11 | 13 | | |
12 | 14 | | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
13 | 44 | | |
14 | 45 | | |
15 | 46 | | |
| |||
347 | 378 | | |
348 | 379 | | |
349 | 380 | | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
0 commit comments