Croissant validation issues

@amberleahey put on my radar a Croissant validation tool I didn't know about: https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker

It looks like it was added in https://github.com/mlcommons/croissant/pull/826 and people who are submitting to NeurIPS are asked (at https://neurips.cc/Conferences/2025/DataHostingGuidelines ) to use that "checker" tool to verify their Croissant file before submitting. I'm assuming the production version of the tool can be cloned from https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker?clone=true

We should spend some time with this checker tool and see what it says. It wasn't on my radar.

From a quick test I can confirm that I see errors reported from this dataset, as reported by Amber: https://borealisdata.ca/dataset.xhtml?persistentId=doi%3A10.5683%2FSP3%2F3N6JVZ&version=1.0

Further, my own dataset doesn't validate with this tool either: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=4.0

When developing the Croissant exporter ( https://github.com/gdcc/exporter-croissant and docs added in #10533), I used the only validation tool I was aware of at the time. Docs are here: https://github.com/mlcommons/croissant/tree/v1.0.17/python/mlcroissant#verifyload-a-croissant-dataset

When I run the "validate" command on my dataset, it looks fine:

`mlcroissant validate --jsonld "https://dataverse.harvard.edu/api/datasets/export?exporter=croissant&persistentId=doi:10.7910/DVN/TJCLKP"`

```
I0501 15:52:55.617974 8719305856 validate.py:53] Done.
```

When I run it against the Borealis dataset above, it shows warnings (but no errors):

`mlcroissant validate --jsonld "https://borealisdata.ca/api/datasets/export?exporter=croissant&persistentId=doi%3A10.5683/SP3/3N6JVZ"`

```
W0501 15:54:11.002909 8719305856 datasets.py:41] Found the following 3 warning(s) during the validation:
  -  `content_size` has cardinality `ONE`, but got a list.
  -  `content_url` has cardinality `ONE`, but got a list.
  -  `md5` has cardinality `ONE`, but got a list.
I0501 15:54:11.002985 8719305856 validate.py:53] Done.
```

I took a quick look at the code but I'm not sure why those warnings are happening. I can also reproduce them with a Kaggle dataset I found at https://github.com/mlcommons/croissant/issues/837 (heads up @goeffthomas):

`mlcroissant validate --jsonld https://www.kaggle.com/datasets/heptapod/titanic/croissant/download

```
W0501 15:57:03.731079 8719305856 rdf.py:80] WARNING: The JSON-LD `@context` is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'isLiveDataset', 'rai', 'examples'}
W0501 15:57:03.765449 8719305856 datasets.py:41] Found the following 2 warning(s) during the validation:
  -  [Metadata(Titanic)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  `source` has cardinality `ONE`, but got a list.
I0501 15:57:03.765532 8719305856 validate.py:53] Done.
```

While I think we should investigate the errors in from checker and the warnings from the validator, we should probably focus a bit more on the checker since NeurIPS is coming up and Harvard Dataverse is listed first in the submission guidelines at https://neurips.cc/Conferences/2025/DataHostingGuidelines (we worked with @mrisdal on this and because of this we working on support for exporting Croissant and other formats while the dataset is still in draft in #11398).

Finally, Amber alerted me to a git repo at https://github.com/kenlhlui/dv-croissant-testing where @kenlhlui has done some analysis of native JSON vs Croissant exports and mismatches between both `@id` values and `contentUrl` values. I haven't absorbed this yet but it may provide clues about the problems with the checker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Croissant validation issues #11462

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Croissant validation issues #11462

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions