dartfx-unf

A high-performance Python implementation of the Universal Numerical Fingerprint (UNF) v6 specification, a format agnostic standard for data fingerprinting.

Warning

Prototype Status: Full alignment with the UNF v6 specification and the canonical Java Dataverse implementation is still a work in progress. This package should be treated as a prototype for evaluation purposes only and is not for production use at this time.

Overview

dartfx-unf is a blazing-fast, memory-efficient calculator for UNF Version 6. It ensures that your data remains identifiable and consistent across different software versions, file formats, and operating systems by normalizing and hashing the underlying data values rather than the file itself.

Built on top of the Polars engine, it provides native support for massive datasets with a professional-grade CLI and a clean Python API.

This package was vibe coded with Claude Opus 4.6 and Gemini 3 Flash.

🛡️ Compliance & Interoperability

The primary goal of dartfx-unf is two-fold:

Specification Compliance: strictly follow the UNF v6 specification.
Canonical Alignment: ensure full interoperability with the canonical Java Dataverse implementation.

The Java implementation has produced thousands of persistent fingerprints over the years that cannot be changed. Interoperability is just as critical as compliance.

Handling Ambiguity: In cases where the specification can be interpreted in multiple ways, dartfx-unf provides configuration options to ensure flexibility. While different options can naturally result in different UNFs, the resulting JSON reports document every option used for absolute traceability.

Key Features

✅ Out-of-Core Streaming: Process multi-gigabyte files with constant memory overhead.
✅ Canonical Alignment (WIP): Aims for parity with the Java Dataverse codebase.
📦 Multi-Format: Native support for Parquet, CSV, and statistical formats (SAS, Stata, SPSS).
📋 Structured Reporting: Generates detailed JSON reports documenting all options for full traceability.
🔗 Dataset Hashing: Combine fingerprints from multiple files into a single dataset-level hash.

Installation

We recommend using uv for fast environment management.

This package is not yet published on PyPI. Install from source using the steps below.

For Users

git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf

# Option 1: pip (editable install)
pip install -e .

# Option 2: uv (editable install)
uv pip install -e .

For Developers

git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf
uv sync

Quick Start

Command Line Interface

Calculate fingerprints directly from your terminal:

# Basic JSON report
uv run dartfx-unf data.parquet

# Disable automatic date parsing for CSVs
uv run dartfx-unf --no-parse-date data.csv

# Quiet mode (just the hash)
uv run dartfx-unf --quiet data.parquet

# Match Dataverse's exact behavior for CSVs (treats empty fields as empty strings)
uv run dartfx-unf --null-as-string --scan-length -1 data.csv


# Detailed summary table
uv run dartfx-unf --verbose file1.csv file2.parquet

Python API

Integrate UNF calculation into your data pipelines:

from dartfx.unf import unf_file

# Calculate and print the hash
report = unf_file("results.parquet")
print(f"UNF: {report.result.unf}")

# Export to validated JSON
json_report = report.to_json(validate=True)

📚 Documentation: Complete documentation is accessible at http://dataartifex.org/docs/dartfx-unf

Why Polars?

To meet the high-performance and "streaming" requirements of modern data science, dartfx-unf leverages Polars:

Vectorized Expressions: Normalization steps map directly to efficient SIMD operations.
Lazy Execution: Optimizes I/O and computation order.
Memory Efficiency: Polars' streaming mode allows us to hash files that are larger than the available RAM.

UNF in Practice

The UNF algorithm is format-agnostic and column-order invariant. It ensures that identical dataset values produce the same fingerprint regardless of how they are stored.

Example: Basic Atomic Values

UNF can be used to calculate a fingerprint for a single value (a "vector" of one element). Identical values across different types (e.g., Integer vs Float) or representations (e.g., Date vs String) results in consistent hashes when normalized according to the specification.

Data Type	Value	Normalized Form (§Ia)	Resulting UNF
Numeric	`1` / `1.0`	`+1.e+`	`UNF:6:tv3XYCv524AfmlFyVOhuZg==`
Numeric	`0` / `-0.0`	`+0.e+` / `-0.e+`	See spec for sign details
String	`"A character String"`	`"A character String\n\0"`	`UNF:6:FYqU7uBl885eHMbpco1ooA==`
Date	`2014-01-13`	`"2014-01-13\n\0"`	`UNF:6:PH+jFA4u+yJSs1sIw64dyw==`
Boolean	`true`	`+1.e+`	`UNF:6:tv3XYCv524AfmlFyVOhuZg==`

Example: Data & Format Invariance

The UNF remains identical even if we swap the column order or change the storage format (e.g., from CSV to Parquet).

dataset_v1.csv:

id,name,sex,dob,income
1,Alice,F,2007-01-12,75000
2,Bob,M,1985-05-15,160000
3,Charlie,M,1992-08-20,50000

dataset_v2.csv (Columns reordered):

id,income,dob,name,sex
1,75000,2007-01-12,Alice,F
2,160000,1985-05-15,Bob,M
3,50000,1992-08-20,Charlie,M

All variations yield the same fingerprint:

# Original CSV
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# CSV with different column order
$ uv run dartfx-unf --quiet dataset_v2.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# Binary Parquet version of the same data
$ uv run dartfx-unf --quiet dataset_v1.parquet
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

💡 Note: If you change the row order, the UNF will change, as the sequence of observations is significant.

Example: Variable (Column) UNFs

You can inspect the fingerprints of individual variables. This is useful for identifying which specific column changed between two versions of a dataset.

$ uv run dartfx-unf --verbose dataset_v1.csv

Output:

COLUMN                         | TYPE         | UNF
--------------------------------------------------------------------------------
id                             | numeric      | UNF:6:AvELPR5QTaBbnq6S22Msow==
name                           | string       | UNF:6:G3RHxSQPXELRGHIJ+FV6qA==
sex                            | string       | UNF:6:VSDSXcRD7ShBmQqv1WR9EA==
dob                            | date         | UNF:6:PH+jFA4u+yJSs1sIw64dyw==
income                         | numeric      | UNF:6:v/5E9kHI79TVvlGYinvxTQ==

Example: Choosing Precision

You can customize the normalization parameters (digits of precision, hash bits, etc.). Changes to these parameters are automatically encoded in the resulting UNF header:

# Standard 7-digit precision (Default)
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# Higher 9-digit precision
$ uv run dartfx-unf --quiet --digits 9 dataset_v1.csv
UNF:6:N9:NvK8CwEepCVQZdjiFCGf2A==

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines and the Implementation Roadmap for current progress.

License

This project is licensed under the MIT License. See LICENSE.txt for details.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
benchmarks		benchmarks
docs		docs
scripts		scripts
src/dartfx/unf		src/dartfx/unf
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMINI.md		GEMINI.md
GOVERNANCE.md		GOVERNANCE.md
IMPLEMENTATION.md		IMPLEMENTATION.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SPECIFICATIONS.md		SPECIFICATIONS.md
UNF_V6.md		UNF_V6.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dartfx-unf

Overview

🛡️ Compliance & Interoperability

Key Features

Installation

For Users

For Developers

Quick Start

Command Line Interface

Python API

Why Polars?

UNF in Practice

Example: Basic Atomic Values

Example: Data & Format Invariance

Example: Variable (Column) UNFs

Example: Choosing Precision

Contributing

License

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

dartfx-unf

Overview

🛡️ Compliance & Interoperability

Key Features

Installation

For Users

For Developers

Quick Start

Command Line Interface

Python API

Why Polars?

UNF in Practice

Example: Basic Atomic Values

Example: Data & Format Invariance

Example: Variable (Column) UNFs

Example: Choosing Precision

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages