Skip to content

Commit 3b9b28b

Browse files
committed
fix(#1442): scan_*_references reads raw JSON metadata instead of decoded codec output
`gc.scan_hash_references` and `gc.scan_schema_references` were calling `table.to_arrays(attr_name)`, which routes through `Expression.to_dicts` (`expression.py:899`) and runs `decode_attribute` (`codecs.py:518`) for each value. For the storage codecs (`<blob@>`, `<hash@>`, `<attach@>`, `<npy@>`, `<object@>`) this means downloading the bytes from external storage and deserializing them into the codec's runtime type — a `numpy.ndarray`, an `NpyRef`, an `ObjectRef`, raw bytes, or a local file-path string — none of which satisfy `_extract_*_refs`'s `isinstance(value, dict) and "path" in value` check. Both helpers therefore silently returned empty reference sets. Every populated schema reported `hash_referenced: 0` and `schema_paths_referenced: 0`, so every external file looked orphaned to `scan()` and a subsequent `collect()` would have deleted live data. The broad `try/except Exception` around the loop never fired because no exception was raised — `_extract_*_refs` returns `[]` for unrecognized shapes by design. The intended design (per `reference/specs/type-system.md`) is for GC to operate on the raw stored JSON metadata, not the decoded payload. Replace `table.to_arrays(attr_name)` with `table.proj(attr_name).cursor(as_dict=True)` in both helpers. The cursor yields the JSON column value directly: a Python dict on PostgreSQL (JSONB auto-decoded) or a JSON string on MySQL. `_extract_*_refs` already handles both shapes (`gc.py:138` string branch, `gc.py:145` dict branch), so this is backend-agnostic with no adapter dispatch. Side effect — `scan` is now a metadata-only operation. Previously it downloaded every external blob just to deserialize and discard the result via the silent type mismatch; on a 1 TB schema that meant 1 TB of egress to produce `referenced: 0`. After this change, scan touches only the JSON column on the database. Custom-codec authors are unaffected: reference discovery operates on the raw stored metadata regardless of what the codec's `decode()` returns, so third-party codecs following the documented `encode`/`decode` contract get correct GC for free. Tests ----- The existing `tests/integration/test_gc.py` mocks `scan_hash_references`, `scan_schema_references`, and `list_*_paths` directly, so the production code path through `to_arrays` → `decode_attribute` was never exercised end-to-end. The mocked tests stay (they cover orchestration: composition with `list_*_paths`, dry-run vs real, stat-key shape, format strings). Add a `TestScanWithLiveData` class with three non-mocked end-to-end tests, one per structurally distinct decoded-value type: - `test_scan_finds_active_blob_reference` — `<blob@>` (decode → ndarray) - `test_scan_finds_active_npy_reference` — `<npy@>` (decode → NpyRef) - `test_scan_finds_active_object_reference` — `<object@>` (decode → ObjectRef) Each declares a small manual table, inserts one row, and asserts `scan(schema, store_name='local')` reports the expected `*_referenced` count > 0. Verified to fail on the pre-fix code: `{'hash_referenced': 0, 'hash_stored': 1, 'hash_orphaned': 1, ...}`. Adjacent -------- Register `gc` in `_lazy_modules` (`src/datajoint/__init__.py`). The `gc.py` module docstring and the user docs at `how-to/garbage-collection.md` both invoke GC as `dj.gc.scan(...)`, which previously raised `AttributeError` because `gc` wasn't lazily exposed at the package level. Pattern matches the existing `"diagram": (".diagram", None)` entry. Out of scope ------------ GC remains non-transaction-safe even after this fix — there is a TOCTOU window between scan and delete during which a concurrent transaction could insert a row referencing what looks like an orphan. A two-phase retrieval/removal API (quarantine → grace window → purge) is the right remedy and will be tracked as a separate enhancement issue. Fixes #1442
1 parent db42c26 commit 3b9b28b

3 files changed

Lines changed: 135 additions & 8 deletions

File tree

src/datajoint/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,9 @@ def FreeTable(conn_or_name, full_table_name: str | None = None) -> _FreeTable:
275275
"diagram": (".diagram", None), # Return the module itself
276276
# cli imports click
277277
"cli": (".cli", "cli"),
278+
# gc — exposed lazily so `dj.gc.scan(...)` works as documented in gc.py
279+
# and in the user docs (how-to/garbage-collection.md).
280+
"gc": (".gc", None), # Return the module itself
278281
}
279282

280283

src/datajoint/gc.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -229,11 +229,14 @@ def scan_hash_references(
229229
if verbose:
230230
logger.info(f" Scanning {table_name}.{attr_name}")
231231

232-
# Fetch all values for this attribute
232+
# Read raw JSON metadata via cursor — bypasses decode_attribute
233+
# so we get the stored dict (PostgreSQL/JSONB) or JSON string
234+
# (MySQL), not the decoded codec output. _extract_hash_refs
235+
# handles both shapes.
233236
try:
234-
values = table.to_arrays(attr_name)
235-
for value in values:
236-
for path, ref_store in _extract_hash_refs(value):
237+
cursor = table.proj(attr_name).cursor(as_dict=True)
238+
for row in cursor:
239+
for path, ref_store in _extract_hash_refs(row[attr_name]):
237240
# Filter by store if specified
238241
if store_name is None or ref_store == store_name:
239242
referenced.add(path)
@@ -291,11 +294,14 @@ def scan_schema_references(
291294
if verbose:
292295
logger.info(f" Scanning {table_name}.{attr_name}")
293296

294-
# Fetch all values for this attribute
297+
# Read raw JSON metadata via cursor — bypasses decode_attribute
298+
# so we get the stored dict (PostgreSQL/JSONB) or JSON string
299+
# (MySQL), not the decoded codec output. _extract_schema_refs
300+
# handles both shapes.
295301
try:
296-
values = table.to_arrays(attr_name)
297-
for value in values:
298-
for path, ref_store in _extract_schema_refs(value):
302+
cursor = table.proj(attr_name).cursor(as_dict=True)
303+
for row in cursor:
304+
for path, ref_store in _extract_schema_refs(row[attr_name]):
299305
# Filter by store if specified
300306
if store_name is None or ref_store == store_name:
301307
referenced.add(path)

tests/integration/test_gc.py

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,43 @@
44

55
from unittest.mock import MagicMock, patch
66

7+
import numpy as np
78
import pytest
89

10+
import datajoint as dj
911
from datajoint import gc
1012
from datajoint.errors import DataJointError
1113

1214

15+
# Tables used by TestScanWithLiveData. Defined at module scope so dj.Schema's
16+
# context resolution can find them by class name; bound to a schema inside
17+
# each fixture (see schema(...) calls below).
18+
19+
20+
class GcBlobTest(dj.Manual):
21+
definition = """
22+
rid : int
23+
---
24+
payload : <blob@local>
25+
"""
26+
27+
28+
class GcNpyTest(dj.Manual):
29+
definition = """
30+
rid : int
31+
---
32+
waveform : <npy@local>
33+
"""
34+
35+
36+
class GcObjectTest(dj.Manual):
37+
definition = """
38+
rid : int
39+
---
40+
results : <object@local>
41+
"""
42+
43+
1344
class TestUsesHashStorage:
1445
"""Tests for _uses_hash_storage helper function."""
1546

@@ -347,3 +378,90 @@ def test_formats_collect_stats_actual(self):
347378
assert "Schema paths: 1" in result
348379
assert "2.00 MB" in result
349380
assert "Errors: 2" in result
381+
382+
383+
class TestScanWithLiveData:
384+
"""End-to-end tests for gc.scan() against real schemas with external storage.
385+
386+
Exercises the full production path:
387+
scan_*_references → table.proj(attr).cursor() → raw JSON metadata.
388+
389+
These are the regression tests that would have caught issue #1442
390+
(silent type mismatch when scan helpers iterated decoded codec outputs
391+
instead of raw stored metadata).
392+
"""
393+
394+
@pytest.fixture
395+
def schema_blob(self, connection_test, prefix, mock_stores):
396+
schema_name = f"{prefix}_test_gc_e2e_blob"
397+
schema = dj.Schema(
398+
schema_name,
399+
context={"GcBlobTest": GcBlobTest},
400+
connection=connection_test,
401+
)
402+
schema(GcBlobTest)
403+
yield schema
404+
schema.drop()
405+
406+
@pytest.fixture
407+
def schema_npy(self, connection_test, prefix, mock_stores):
408+
schema_name = f"{prefix}_test_gc_e2e_npy"
409+
schema = dj.Schema(
410+
schema_name,
411+
context={"GcNpyTest": GcNpyTest},
412+
connection=connection_test,
413+
)
414+
schema(GcNpyTest)
415+
yield schema
416+
schema.drop()
417+
418+
@pytest.fixture
419+
def schema_object(self, connection_test, prefix, mock_stores):
420+
schema_name = f"{prefix}_test_gc_e2e_object"
421+
schema = dj.Schema(
422+
schema_name,
423+
context={"GcObjectTest": GcObjectTest},
424+
connection=connection_test,
425+
)
426+
schema(GcObjectTest)
427+
yield schema
428+
schema.drop()
429+
430+
def test_scan_finds_active_blob_reference(self, schema_blob):
431+
"""scan() must report hash_referenced >= 1 for a populated <blob@> column.
432+
433+
Decoded value type returned by BlobCodec.decode is numpy.ndarray, which
434+
does not satisfy `_extract_hash_refs`'s dict/JSON-string check — this
435+
test fails before the cursor-based fix in scan_hash_references.
436+
"""
437+
GcBlobTest.insert1({"rid": 1, "payload": np.arange(64, dtype="uint8")})
438+
439+
stats = gc.scan(schema_blob, store_name="local")
440+
441+
assert stats["hash_referenced"] >= 1, f"scan should find the active <blob@> reference; got {stats}"
442+
443+
def test_scan_finds_active_npy_reference(self, schema_npy):
444+
"""scan() must report schema_paths_referenced >= 1 for a populated <npy@> column.
445+
446+
Decoded value type returned by NpyCodec.decode is NpyRef (lazy handle),
447+
which does not satisfy `_extract_schema_refs`'s dict check — this test
448+
fails before the cursor-based fix in scan_schema_references.
449+
"""
450+
GcNpyTest.insert1({"rid": 1, "waveform": np.arange(64, dtype="float32")})
451+
452+
stats = gc.scan(schema_npy, store_name="local")
453+
454+
assert stats["schema_paths_referenced"] >= 1, f"scan should find the active <npy@> reference; got {stats}"
455+
456+
def test_scan_finds_active_object_reference(self, schema_object):
457+
"""scan() must report schema_paths_referenced >= 1 for a populated <object@> column.
458+
459+
Decoded value type returned by ObjectCodec.decode is ObjectRef (lazy
460+
handle), which does not satisfy `_extract_schema_refs`'s dict check —
461+
this test fails before the cursor-based fix in scan_schema_references.
462+
"""
463+
GcObjectTest.insert1({"rid": 1, "results": b"hello-gc-test"})
464+
465+
stats = gc.scan(schema_object, store_name="local")
466+
467+
assert stats["schema_paths_referenced"] >= 1, f"scan should find the active <object@> reference; got {stats}"

0 commit comments

Comments
 (0)