@amberleahey put on my radar a Croissant validation tool I didn't know about: https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker
It looks like it was added in mlcommons/croissant#826 and people who are submitting to NeurIPS are asked (at https://neurips.cc/Conferences/2025/DataHostingGuidelines ) to use that "checker" tool to verify their Croissant file before submitting. I'm assuming the production version of the tool can be cloned from https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker?clone=true
We should spend some time with this checker tool and see what it says. It wasn't on my radar.
From a quick test I can confirm that I see errors reported from this dataset, as reported by Amber: https://borealisdata.ca/dataset.xhtml?persistentId=doi%3A10.5683%2FSP3%2F3N6JVZ&version=1.0
Further, my own dataset doesn't validate with this tool either: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=4.0
When developing the Croissant exporter ( https://github.com/gdcc/exporter-croissant and docs added in #10533), I used the only validation tool I was aware of at the time. Docs are here: https://github.com/mlcommons/croissant/tree/v1.0.17/python/mlcroissant#verifyload-a-croissant-dataset
When I run the "validate" command on my dataset, it looks fine:
mlcroissant validate --jsonld "https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP"
I0501 15:52:55.617974 8719305856 validate.py:53] Done.
When I run it against the Borealis dataset above, it shows warnings (but no errors):
mlcroissant validate --jsonld "https://borealisdata.ca/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5683/SP3/3N6JVZ"
W0501 15:54:11.002909 8719305856 datasets.py:41] Found the following 3 warning(s) during the validation:
- `content_size` has cardinality `ONE`, but got a list.
- `content_url` has cardinality `ONE`, but got a list.
- `md5` has cardinality `ONE`, but got a list.
I0501 15:54:11.002985 8719305856 validate.py:53] Done.
I took a quick look at the code but I'm not sure why those warnings are happening. I can also reproduce them with a Kaggle dataset I found at mlcommons/croissant#837 (heads up @goeffthomas):
`mlcroissant validate --jsonld https://www.kaggle.com/datasets/heptapod/titanic/croissant/download
W0501 15:57:03.731079 8719305856 rdf.py:80] WARNING: The JSON-LD `@context` is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'isLiveDataset', 'rai', 'examples'}
W0501 15:57:03.765449 8719305856 datasets.py:41] Found the following 2 warning(s) during the validation:
- [Metadata(Titanic)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
- `source` has cardinality `ONE`, but got a list.
I0501 15:57:03.765532 8719305856 validate.py:53] Done.
While I think we should investigate the errors in from checker and the warnings from the validator, we should probably focus a bit more on the checker since NeurIPS is coming up and Harvard Dataverse is listed first in the submission guidelines at https://neurips.cc/Conferences/2025/DataHostingGuidelines (we worked with @mrisdal on this and because of this we working on support for exporting Croissant and other formats while the dataset is still in draft in #11398).
Finally, Amber alerted me to a git repo at https://github.com/kenlhlui/dv-croissant-testing where @kenlhlui has done some analysis of native JSON vs Croissant exports and mismatches between both @id values and contentUrl values. I haven't absorbed this yet but it may provide clues about the problems with the checker.
@amberleahey put on my radar a Croissant validation tool I didn't know about: https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker
It looks like it was added in mlcommons/croissant#826 and people who are submitting to NeurIPS are asked (at https://neurips.cc/Conferences/2025/DataHostingGuidelines ) to use that "checker" tool to verify their Croissant file before submitting. I'm assuming the production version of the tool can be cloned from https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker?clone=true
We should spend some time with this checker tool and see what it says. It wasn't on my radar.
From a quick test I can confirm that I see errors reported from this dataset, as reported by Amber: https://borealisdata.ca/dataset.xhtml?persistentId=doi%3A10.5683%2FSP3%2F3N6JVZ&version=1.0
Further, my own dataset doesn't validate with this tool either: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=4.0
When developing the Croissant exporter ( https://github.com/gdcc/exporter-croissant and docs added in #10533), I used the only validation tool I was aware of at the time. Docs are here: https://github.com/mlcommons/croissant/tree/v1.0.17/python/mlcroissant#verifyload-a-croissant-dataset
When I run the "validate" command on my dataset, it looks fine:
mlcroissant validate --jsonld "https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP"When I run it against the Borealis dataset above, it shows warnings (but no errors):
mlcroissant validate --jsonld "https://borealisdata.ca/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5683/SP3/3N6JVZ"I took a quick look at the code but I'm not sure why those warnings are happening. I can also reproduce them with a Kaggle dataset I found at mlcommons/croissant#837 (heads up @goeffthomas):
`mlcroissant validate --jsonld https://www.kaggle.com/datasets/heptapod/titanic/croissant/download
While I think we should investigate the errors in from checker and the warnings from the validator, we should probably focus a bit more on the checker since NeurIPS is coming up and Harvard Dataverse is listed first in the submission guidelines at https://neurips.cc/Conferences/2025/DataHostingGuidelines (we worked with @mrisdal on this and because of this we working on support for exporting Croissant and other formats while the dataset is still in draft in #11398).
Finally, Amber alerted me to a git repo at https://github.com/kenlhlui/dv-croissant-testing where @kenlhlui has done some analysis of native JSON vs Croissant exports and mismatches between both
@idvalues andcontentUrlvalues. I haven't absorbed this yet but it may provide clues about the problems with the checker.