You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#923
I went through the draft and made sure external links worked. Everything
seems to work, with the final urls being the same most of the time.
This PR:
* Fix a URL typo
* Adjust book reference to use DOI url
Questions:
* Most places reference schemas with `http://`, but some use `https://`.
How should schema urls be referenced in paragraphs and examples? The
links work either way, when followed the http redirects to https.
* There are some 3rd party datasets that are referenced. They still
exist, but may become dead links in the future. If possible (license
permitting) should they be replicated in the repo or placed somewhere
like torrent DHT or ipfs?
Copy file name to clipboardExpand all lines: docs/croissant-spec-draft.md
+30-30Lines changed: 30 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
Version 1.1 (draft)
4
4
5
5
**This is a draft of the Croissant 1.1 specification. This document is a work in progress.
6
-
For the latest official specification, please see the [Croissant 1.0 specification](https://http://mlcommons.org/croissant/1.0).**
6
+
For the latest official specification, please see the [Croissant 1.0 specification](http://mlcommons.org/croissant/1.0).**
7
7
8
8
<!-- Published: -->
9
9
@@ -55,7 +55,7 @@ Creating or changing the metadata is straightforward. A dataset repository can i
55
55
56
56
### Responsible AI
57
57
58
-
As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://link.springer.com/book/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation.
58
+
As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://doi.org/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation.
59
59
60
60
This is how Croissant helps address RAI:
61
61
@@ -85,7 +85,7 @@ Croissant metadata is encoded in JSON-LD.
85
85
{
86
86
"@context": {
87
87
"@language": "en",
88
-
"@vocab": "https://schema.org/"
88
+
"@vocab": "http://schema.org/"
89
89
},
90
90
"@type": "sc:Dataset",
91
91
"name": "simple-pass",
@@ -406,7 +406,7 @@ These [schema.org](http://schema.org) properties are recommended for every Crois
<td>The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip".</td>
<td>The formats of the file, given as a mime type. Unregistered or niche encoding and file formats can be indicated instead via the most appropriate URL, e.g. defining Web page or a Wikipedia/Wikidata entry.</td>
@@ -1016,7 +1016,7 @@ The ratings `RecordSet` above corresponds to a CSV table, declared elsewhere as
1016
1016
1017
1017
`RecordSet`s specify where to get their data via the `dataSource` property of Field. `DataSource` is the class describing the data that can be extracted from files to populate a `RecordSet`. This class should be used when the data coming from the source needs to be transformed or formatted to be included in the ML dataset; otherwise a simple `Reference` can be used instead to point to the source.
1018
1018
1019
-
`DataSource` is a subclassOf: [sc:Intangible](https://schema.org/Intangible) and defines the following properties:
1019
+
`DataSource` is a subclassOf: [sc:Intangible](http://schema.org/Intangible) and defines the following properties:
1020
1020
1021
1021
<table>
1022
1022
<thead>
@@ -1191,31 +1191,31 @@ Commonly used atomic data types:
<td>Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: <a href="https://www.wikidata.org/wiki/Q6581097">wd:Q6581097</a>, <a href="https://www.wikidata.org/wiki/Q6581072">wd:Q6581072</a>, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example.</td>
1262
+
<td>Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: <a href="http://www.wikidata.org/wiki/Q6581097">wd:Q6581097</a>, <a href="http://www.wikidata.org/wiki/Q6581072">wd:Q6581072</a>, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example.</td>
1263
1263
</tr>
1264
1264
</table>
1265
1265
@@ -1273,13 +1273,13 @@ In the following example, `color_sample` is a field containing an image, but wit
1273
1273
}
1274
1274
```
1275
1275
1276
-
In the following example, the `url` field is expected to be a URL, whose semantic type is [City](https://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: "<https://www.wikidata.org/wiki/Q90>").
1276
+
In the following example, the `url` field is expected to be a URL, whose semantic type is [City](http://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: "<http://www.wikidata.org/wiki/Q90>").
@@ -1582,9 +1582,9 @@ We now introduce a number of features that are useful in the context of ML data.
1582
1582
1583
1583
### Categorical Data
1584
1584
1585
-
In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](https://schema.org/Enumeration) class from [schema.org](https://schema.org), as a `dataType` on `RecordSet`s that hold categorical data.
1585
+
In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](http://schema.org/Enumeration) class from [schema.org](http://schema.org), as a `dataType` on `RecordSet`s that hold categorical data.
1586
1586
1587
-
These RecordSets must define a `name` field conforming with the [sc:name](https://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance.
1587
+
These RecordSets must define a `name` field conforming with the [sc:name](http://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance.
1588
1588
1589
1589
For example, the [COCO](https://cocodataset.org/#format-data) dataset defines categories and super-categories ([Croissant definition](https://github.com/mlcommons/croissant/blob/main/datasets/1.0/coco2014/metadata.json)), to which are associated other parts of the dataset. Using Croissant, one can describe the COCO super-categories the following way:
1590
1590
@@ -1861,8 +1861,8 @@ Segmentation mask as an image:
0 commit comments