Skip to content

Is there any content hash/checksum of a version of a symbol #2997

@himsheda

Description

@himsheda

I have two ArcticDB instances that I need to verify are in sync. The data per symbol is large and I don't want to read it entirely just to compare content.
Two questions:

  1. Is there any existing functionality that provides a content hash or fingerprint for a symbol without reading all the data?
    I'm aware of the internal segment-level checksums ArcticDB maintains, but these don't appear to be exposed via the Python API. get_description() returns structural metadata but not a content hash. Is there anything else I'm missing?
  2. If not, would it be possible to expose a stable logical content hash?
    The key requirement is that the hash reflects logical data content, not write history. Two symbols containing identical rows must produce the same hash regardless of the sequence of operations that produced them. For example:
Instance A:  write(d1) → append(d2) → append(d3) → update(patch)
Instance B:  write(full_snapshot)

same logical data → must produce same hash

This rules out hashing at the segment level since segment boundaries differ across write patterns.
A sum of per-row hashes over a stable sort order would satisfy this, it is order-independent, additive, and consistent across write patterns. Something like:

lib.get_content_hash(symbol)         # full symbol
lib.get_content_hash(symbol, date_range=(start, end))   # scoped

would be very useful. Pushing this into the C++ engine would avoid the need to materialise the full DataFrame in Python just to compute a fingerprint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions