-
Notifications
You must be signed in to change notification settings - Fork 174
Is there any content hash/checksum of a version of a symbol #2997
Copy link
Copy link
Open
Description
I have two ArcticDB instances that I need to verify are in sync. The data per symbol is large and I don't want to read it entirely just to compare content.
Two questions:
- Is there any existing functionality that provides a content hash or fingerprint for a symbol without reading all the data?
I'm aware of the internal segment-level checksums ArcticDB maintains, but these don't appear to be exposed via the Python API. get_description() returns structural metadata but not a content hash. Is there anything else I'm missing? - If not, would it be possible to expose a stable logical content hash?
The key requirement is that the hash reflects logical data content, not write history. Two symbols containing identical rows must produce the same hash regardless of the sequence of operations that produced them. For example:
Instance A: write(d1) → append(d2) → append(d3) → update(patch)
Instance B: write(full_snapshot)
same logical data → must produce same hash
This rules out hashing at the segment level since segment boundaries differ across write patterns.
A sum of per-row hashes over a stable sort order would satisfy this, it is order-independent, additive, and consistent across write patterns. Something like:
lib.get_content_hash(symbol) # full symbol
lib.get_content_hash(symbol, date_range=(start, end)) # scopedwould be very useful. Pushing this into the C++ engine would avoid the need to materialise the full DataFrame in Python just to compute a fingerprint.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels