Skip to content

Croissant validation issues #11462

@pdurbin

Description

@pdurbin

@amberleahey put on my radar a Croissant validation tool I didn't know about: https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker

It looks like it was added in mlcommons/croissant#826 and people who are submitting to NeurIPS are asked (at https://neurips.cc/Conferences/2025/DataHostingGuidelines ) to use that "checker" tool to verify their Croissant file before submitting. I'm assuming the production version of the tool can be cloned from https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker?clone=true

We should spend some time with this checker tool and see what it says. It wasn't on my radar.

From a quick test I can confirm that I see errors reported from this dataset, as reported by Amber: https://borealisdata.ca/dataset.xhtml?persistentId=doi%3A10.5683%2FSP3%2F3N6JVZ&version=1.0

Further, my own dataset doesn't validate with this tool either: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=4.0

When developing the Croissant exporter ( https://github.com/gdcc/exporter-croissant and docs added in #10533), I used the only validation tool I was aware of at the time. Docs are here: https://github.com/mlcommons/croissant/tree/v1.0.17/python/mlcroissant#verifyload-a-croissant-dataset

When I run the "validate" command on my dataset, it looks fine:

mlcroissant validate --jsonld "https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP"

I0501 15:52:55.617974 8719305856 validate.py:53] Done.

When I run it against the Borealis dataset above, it shows warnings (but no errors):

mlcroissant validate --jsonld "https://borealisdata.ca/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5683/SP3/3N6JVZ"

W0501 15:54:11.002909 8719305856 datasets.py:41] Found the following 3 warning(s) during the validation:
  -  `content_size` has cardinality `ONE`, but got a list.
  -  `content_url` has cardinality `ONE`, but got a list.
  -  `md5` has cardinality `ONE`, but got a list.
I0501 15:54:11.002985 8719305856 validate.py:53] Done.

I took a quick look at the code but I'm not sure why those warnings are happening. I can also reproduce them with a Kaggle dataset I found at mlcommons/croissant#837 (heads up @goeffthomas):

`mlcroissant validate --jsonld https://www.kaggle.com/datasets/heptapod/titanic/croissant/download

W0501 15:57:03.731079 8719305856 rdf.py:80] WARNING: The JSON-LD `@context` is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'isLiveDataset', 'rai', 'examples'}
W0501 15:57:03.765449 8719305856 datasets.py:41] Found the following 2 warning(s) during the validation:
  -  [Metadata(Titanic)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  `source` has cardinality `ONE`, but got a list.
I0501 15:57:03.765532 8719305856 validate.py:53] Done.

While I think we should investigate the errors in from checker and the warnings from the validator, we should probably focus a bit more on the checker since NeurIPS is coming up and Harvard Dataverse is listed first in the submission guidelines at https://neurips.cc/Conferences/2025/DataHostingGuidelines (we worked with @mrisdal on this and because of this we working on support for exporting Croissant and other formats while the dataset is still in draft in #11398).

Finally, Amber alerted me to a git repo at https://github.com/kenlhlui/dv-croissant-testing where @kenlhlui has done some analysis of native JSON vs Croissant exports and mismatches between both @id values and contentUrl values. I haven't absorbed this yet but it may provide clues about the problems with the checker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CroissantCroissant and Kaggle related workFY25 Sprint 23FY25 Sprint 23 (2025-05-07 - 2025-05-21)FY25 Sprint 24FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 25FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 26FY25 Sprint 26 (2025-06-18 - 2025-07-02)FY26 Sprint 1FY26 Sprint 1 (2025-07-02 - 2025-07-16)Size: 10A percentage of a sprint. 7 hours.Type: Buga defect

    Type

    No type

    Projects

    Status

    Done 🧹

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions