Skip to content

Commit 30ab514

Browse files
authored
Make Croissant built-in to Dataverse (no longer an external exporter) (#12130)
* copy post-0.1.6 croissant code from external repo This commit, specifically: gdcc/exporter-croissant@a0c3b80 * add spotless config, limit it to croissant for now * put Croissant in <head> by default, add flag for old behavior #11254 * add release note #11254 * remove generated files * gitignore files generated by tests * add croissant to expected export formats #11254 * list new setting and fix typo #11254 * convert feature flag to jvm option #11254 * wire ui:param to backing bean, group related settings #11254
1 parent fea7dcc commit 30ab514

51 files changed

Lines changed: 5463 additions & 19 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
## Croissant Support Is Now Built In
2+
3+
Croissant is a metadata export format for machine learning datasets that (until this release) was optional and implemented as external exporter. The code has been merged into the main Dataverse code base which means the Croissant format is automatically available in your installation of Dataverse, alongside older formats like Dublin Core and DDI. If you were using the external Croissant exporter, the merged code is equivalent to version 0.1.6. Croissant bugs and feature requests should now be filed against the main Dataverse repo (https://github.com/IQSS/dataverse) and the old repo (https://github.com/gdcc/exporter-croissant) should be considered retired.
4+
5+
As described in the [Discoverability](https://dataverse-guide--12130.org.readthedocs.build/en/12130/admin/discoverability.html#id6) section of the Admin Guide, Croissant is inserted into the "head" of the HTML of dataset landing pages, as requested by the [Google Dataset Search](https://datasetsearch.research.google.com) team so that their tool can filter by datasets that support Croissant. In previous versions of Dataverse, when Croissant was optional and hadn't been enabled, we used the older "Schema.org JSON-LD" format in the "head". If you'd like to keep this behavior, you can use the feature flag [dataverse.legacy.schemaorg-in-html-head](https://dataverse-guide--12130.org.readthedocs.build/en/12130/installation/config.html#dataverse.legacy.schemaorg-in-html-head).
6+
7+
We are aware that the amount of data in the "head" of the HTML can grow quite large for both Croissant and Schema.org JSON-LD. This is especially true of Croissant which exposes variable-level information. We plan to address this in https://github.com/IQSS/dataverse/issues/12123 . We also plan to support Croissant 1.1 in the future and are tracking this at https://github.com/IQSS/dataverse/issues/12014 .
8+
9+
See also #11254 and #12130.
10+
11+
## New Settings
12+
13+
- dataverse.legacy.schemaorg-in-html-head

doc/sphinx-guides/source/admin/discoverability.rst

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -30,21 +30,22 @@ The HTML source of a dataset landing page includes "DC" (Dublin Core) ``<meta>``
3030
<meta name="DC.type" content="Dataset"
3131
<meta name="DC.title" content="..."
3232

33-
.. _schema.org-head:
33+
.. _croissant-head:
3434

35-
Schema.org JSON-LD/Croissant Metadata
36-
+++++++++++++++++++++++++++++++++++++
35+
Croissant Metadata in the ``<head>`` of Dataset Landing Pages
36+
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3737

38-
The ``<head>`` of the HTML source of a dataset landing page includes Schema.org JSON-LD metadata like this::
38+
`Croissant <https://github.com/mlcommons/croissant>`_ is a metadata format for machine learning datasets.
3939

40+
In Dataverse, the ``<head>`` of the HTML source of a dataset landing page includes Croissant metadata like this::
4041

41-
<script type="application/ld+json">{"@context":"http://schema.org","@type":"Dataset","@id":"https://doi.org/...
42+
<script type="application/ld+json">{"@context":..."cr":"http://mlcommons.org/croissant/"...
4243

43-
If you enable the Croissant metadata export format (see :ref:`external-exporters`) the ``<head>`` will show Croissant metadata instead. It looks similar, but you should see ``"cr": "http://mlcommons.org/croissant/"`` in the output.
44+
This is the same Croissant file you can download from a dataset landing page by clicking "Metadata" then "Export Metadata" (see :ref:`metadata-export-formats`) and the API (see ``croissant`` at :ref:`export-dataset-metadata-api`).
4445

45-
For backward compatibility, if you enable Croissant, the older Schema.org JSON-LD format (``schema.org`` in the API) will still be available from both the web interface (see :ref:`metadata-export-formats`) and the API (see :ref:`export-dataset-metadata-api`).
46+
We include Croissant in the ``<head>`` because it's `recommended <https://github.com/mlcommons/croissant/issues/530#issuecomment-1964227662>`_ by Google for `Google Dataset Search <https://datasetsearch.research.google.com>`_, where they offer a filter to narrow results to only datasets with support for Croissant.
4647

47-
The Dataverse team has been working with Google on both formats. Google has `indicated <https://github.com/mlcommons/croissant/issues/530#issuecomment-1964227662>`_ that for `Google Dataset Search <https://datasetsearch.research.google.com>`_ (the main reason we started adding this extra metadata in the ``<head>`` of dataset pages), Croissant is the successor to the older format.
48+
Before Croissant was invented, Google recommended a different format that Dataverse refers to as "Schema.org JSON-LD" in the user interface (and ``schema.org`` in the API). If you prefer to put that older format in the ``<head>``, which was the behavior in older versions of Dataverse, see :ref:`dataverse.legacy.schemaorg-in-html-head`.
4849

4950
.. _discovery-sign-posting:
5051

doc/sphinx-guides/source/api/native-api.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2006,6 +2006,7 @@ Available Dataset Metadata Exporters
20062006

20072007
The following dataset metadata exporters ship with Dataverse:
20082008

2009+
- ``croissant``
20092010
- ``Datacite``
20102011
- ``dataverse_json``
20112012
- ``dcterms``
@@ -2034,7 +2035,7 @@ Please note that the ``schema.org`` format has changed in backwards-incompatible
20342035

20352036
Both forms are valid according to Google's Structured Data Testing Tool at https://search.google.com/structured-data/testing-tool . Schema.org JSON-LD is an evolving standard that permits a great deal of flexibility. For example, https://schema.org/docs/gs.html#schemaorg_expected indicates that even when objects are expected, it's ok to just use text. As with all metadata export formats, we will try to keep the Schema.org JSON-LD format backward-compatible to make integrations more stable, despite the flexibility that's afforded by the standard.
20362037

2037-
The standard has further evolved into a format called Croissant. For details, see :ref:`schema.org-head` in the Admin Guide.
2038+
The standard has further evolved into a format called Croissant. For details, see :ref:`croissant-head` in the Admin Guide.
20382039

20392040
The ``schema.org`` format changed after Dataverse 6.4 as well. Previously its content type was "application/json" but now it is "application/ld+json".
20402041

doc/sphinx-guides/source/developers/coding-style.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ Java
1313
Formatting Code
1414
~~~~~~~~~~~~~~~
1515

16+
How to format Java code is being discussed on `Zulip <https://dataverse.zulipchat.com/#narrow/channel/379673-dev/topic/code.20formatting.20.28Spotless.2C.20Checkstyle.2C.20etc.2E.29/near/432974039>`_ and the `dev mailing list <https://groups.google.com/g/dataverse-dev/c/y2Jpk3szTf8/m/NhTJvXblAgAJ>`_.
17+
1618
Tabs vs. Spaces
1719
^^^^^^^^^^^^^^^
1820

@@ -59,10 +61,21 @@ Place curly braces according to the style below, which is an example you can see
5961
}
6062
}
6163
64+
Format Code with Spotless
65+
^^^^^^^^^^^^^^^^^^^^^^^^^
66+
67+
In some of our libraries we've had success formatting code with `Spotless <https://github.com/diffplug/spotless>`_. See https://github.com/gdcc/xoai/issues/35 for an early discussion.
68+
69+
We've added Spotless to the main repo but have limited it to certain files. If you'd like to use Spotless on files you're editing, update the config in pom.xml to include them.
70+
71+
To run Spotless on your code:
72+
73+
``mvn spotless:apply``
74+
6275
Format Code You Changed with Netbeans
6376
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
6477

65-
IQSS has standardized on Netbeans. It is much appreciated when you format your code (but only the code you touched!) using the out-of-the-box Netbeans configuration. If you have created an entirely new Java class, you can just click Source -> Format. If you are adjusting code in an existing class, highlight the code you changed and then click Source -> Format. Keeping the "diff" in your pull requests small makes them easier to code review.
78+
For a long time IQSS standardized on Netbeans. For files not included in the Spotless config mentioned above, it is much appreciated when you format your code (but only the code you touched!) using the out-of-the-box Netbeans configuration. If you have created an entirely new Java class, you can just click Source -> Format. If you are adjusting code in an existing class, highlight the code you changed and then click Source -> Format. Keeping the "diff" in your pull requests small makes them easier to code review.
6679

6780
Checking Your Formatting With Checkstyle
6881
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

doc/sphinx-guides/source/installation/advanced.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,9 +136,8 @@ Use the :ref:`dataverse.spi.exporters.directory` configuration option to specify
136136
Inventory of External Exporters
137137
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
138138

139-
For a list of external exporters, see the README at https://github.com/gdcc/dataverse-exporters. To highlight a few:
139+
For a list of external exporters, see the README at https://github.com/gdcc/dataverse-exporters. For example:
140140

141-
- Croissant
142141
- RO-Crate
143142

144143
Developing New Exporters

doc/sphinx-guides/source/installation/config.rst

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3851,6 +3851,15 @@ Example: ``dataverse.api.mdc.min-delay-ms=100`` (enforces a minimum 100ms delay
38513851

38523852
Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_API_MDC_MIN_DELAY_MS``.
38533853

3854+
.. _dataverse.legacy.schemaorg-in-html-head:
3855+
3856+
dataverse.legacy.schemaorg-in-html-head
3857+
+++++++++++++++++++++++++++++++++++++++
3858+
3859+
Instead of Croissant, use the legacy format (Schema.org JSON-LD) in the head of dataset landing pages by setting ``dataverse.legacy.schemaorg-in-html-head=true``. See :ref:`croissant-head`.
3860+
3861+
Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_LEGACY_SCHEMAORG_IN_HTML_HEAD``.
3862+
38543863
.. dataverse.ldn
38553864
38563865
Linked Data Notifications (LDN) Allowed Hosts
@@ -4033,7 +4042,6 @@ Only contact DataCite to update a DOI after checking to see if DataCite has outd
40334042

40344043

40354044

4036-
40374045
.. _:ApplicationServerSettings:
40384046

40394047
Application Server Settings

doc/sphinx-guides/source/user/dataset-management.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Supported Metadata Export Formats
2828

2929
Once a dataset has been published, its metadata can be exported in a variety of other metadata standards and formats, which help make datasets more :doc:`discoverable </admin/discoverability>` and usable in other systems, such as other data repositories. On each dataset page's metadata tab, the following exports are available:
3030

31+
- Croissant
3132
- Dublin Core
3233
- DDI (Data Documentation Initiative Codebook 2.5)
3334
- DDI HTML Codebook (A more human-readable, HTML version of the DDI Codebook 2.5 metadata export)
@@ -37,9 +38,8 @@ Once a dataset has been published, its metadata can be exported in a variety of
3738
- OpenAIRE
3839
- Schema.org JSON-LD
3940

40-
Additional formats can be enabled. See :ref:`inventory-of-external-exporters` in the Installation Guide. To highlight a few:
41+
Additional formats can be enabled. See :ref:`inventory-of-external-exporters` in the Installation Guide. For example:
4142

42-
- Croissant
4343
- RO-Crate
4444

4545
Each of these metadata exports contains the metadata of the most recently published version of the dataset.

pom.xml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1116,6 +1116,30 @@
11161116
</execution>
11171117
</executions>
11181118
</plugin>
1119+
<plugin>
1120+
<groupId>com.diffplug.spotless</groupId>
1121+
<artifactId>spotless-maven-plugin</artifactId>
1122+
<version>3.2.1</version>
1123+
<configuration>
1124+
<java>
1125+
<includes>
1126+
<include>src/main/java/edu/harvard/iq/dataverse/export/CroissantExporter.java</include>
1127+
<include>src/test/java/edu/harvard/iq/dataverse/export/CroissantExporterTest.java</include>
1128+
</includes>
1129+
<importOrder>
1130+
<wildcardsLast>false</wildcardsLast>
1131+
</importOrder>
1132+
<removeUnusedImports>
1133+
<engine>google-java-format</engine>
1134+
</removeUnusedImports>
1135+
<googleJavaFormat>
1136+
<version>1.17.0</version>
1137+
<style>AOSP</style>
1138+
<reflowLongStrings>true</reflowLongStrings>
1139+
</googleJavaFormat>
1140+
</java>
1141+
</configuration>
1142+
</plugin>
11191143
</plugins>
11201144
</build>
11211145
<profiles>

src/main/java/edu/harvard/iq/dataverse/DatasetPage.java

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1486,6 +1486,10 @@ public boolean canSeeCurationStatus() {
14861486
}
14871487
}
14881488

1489+
public boolean isUseLegacyFormatInHead() {
1490+
return JvmSettings.SCHEMAORG_IN_HTML_HEAD.lookupOptional(Boolean.class).orElse(false);
1491+
}
1492+
14891493
/*
14901494
* 4.2.1 optimization.
14911495
* HOWEVER, this doesn't appear to be saving us anything!

0 commit comments

Comments
 (0)