Skip to content

Commit 306fff2

Browse files
authored
1.1 chore: Verify and fix link urls (#929)
#923 I went through the draft and made sure external links worked. Everything seems to work, with the final urls being the same most of the time. This PR: * Fix a URL typo * Adjust book reference to use DOI url Questions: * Most places reference schemas with `http://`, but some use `https://`. How should schema urls be referenced in paragraphs and examples? The links work either way, when followed the http redirects to https. * There are some 3rd party datasets that are referenced. They still exist, but may become dead links in the future. If possible (license permitting) should they be replicated in the repo or placed somewhere like torrent DHT or ipfs?
1 parent d1fe816 commit 306fff2

1 file changed

Lines changed: 30 additions & 30 deletions

File tree

docs/croissant-spec-draft.md

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Version 1.1 (draft)
44

55
**This is a draft of the Croissant 1.1 specification. This document is a work in progress.
6-
For the latest official specification, please see the [Croissant 1.0 specification](https://http://mlcommons.org/croissant/1.0).**
6+
For the latest official specification, please see the [Croissant 1.0 specification](http://mlcommons.org/croissant/1.0).**
77

88
<!-- Published: -->
99

@@ -55,7 +55,7 @@ Creating or changing the metadata is straightforward. A dataset repository can i
5555

5656
### Responsible AI
5757

58-
As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://link.springer.com/book/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation.
58+
As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://doi.org/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation.
5959

6060
This is how Croissant helps address RAI:
6161

@@ -85,7 +85,7 @@ Croissant metadata is encoded in JSON-LD.
8585
{
8686
"@context": {
8787
"@language": "en",
88-
"@vocab": "https://schema.org/"
88+
"@vocab": "http://schema.org/"
8989
},
9090
"@type": "sc:Dataset",
9191
"name": "simple-pass",
@@ -406,7 +406,7 @@ These [schema.org](http://schema.org) properties are recommended for every Crois
406406
<tr>
407407
<td><a href="http://schema.org/keywords">keywords</a></td>
408408
<td>
409-
<a href="https://schema.org/DefinedTerm">DefinedTerm</a><br>
409+
<a href="http://schema.org/DefinedTerm">DefinedTerm</a><br>
410410
<a href="http://schema.org/Text">Text</a><br>
411411
<a href="http://schema.org/URL">URL</a>
412412
</td>
@@ -467,8 +467,8 @@ These [schema.org](http://schema.org) properties are recommended for every Crois
467467
<tr>
468468
<td><a href="http://schema.org/inLanguage">inLanguage</a></td>
469469
<td>
470-
<a href="https://schema.org/Language">Language</a><br>
471-
<a href="https://schema.org/Text">Text</a>
470+
<a href="http://schema.org/Language">Language</a><br>
471+
<a href="http://schema.org/Text">Text</a>
472472
</td>
473473
<td>MANY</td>
474474
<td>The language(s) of the content of the dataset.</td>
@@ -609,37 +609,37 @@ Most of the important properties needed to describe a `FileObject` are defined i
609609
<th>Description</th>
610610
</thead>
611611
<tr>
612-
<td><a href="https://schema.org/name">sc:name</a></td>
612+
<td><a href="http://schema.org/name">sc:name</a></td>
613613
<td><a href="http://schema.org/Text">Text</a></td>
614614
<td>ONE</td>
615615
<td>The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip".</td>
616616
</tr>
617617
<tr>
618-
<td><a href="https://schema.org/contentUrl">sc:contentUrl</a></td>
618+
<td><a href="http://schema.org/contentUrl">sc:contentUrl</a></td>
619619
<td><a href="http://schema.org/URL">URL</a></td>
620620
<td>ONE</td>
621621
<td>Actual bytes of the media object, for example the image file or video file.</td>
622622
</tr>
623623
<tr>
624-
<td><a href="https://schema.org/contentSize">sc:contentSize</a></td>
624+
<td><a href="http://schema.org/contentSize">sc:contentSize</a></td>
625625
<td><a href="http://schema.org/Text">Text</a></td>
626626
<td>ONE</td>
627627
<td>File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified.</td>
628628
</tr>
629629
<tr>
630-
<td><a href="https://schema.org/encodingFormat">sc:encodingFormat</a></td>
630+
<td><a href="http://schema.org/encodingFormat">sc:encodingFormat</a></td>
631631
<td><a href="http://schema.org/Text">Text</a></td>
632632
<td>MANY</td>
633633
<td>The formats of the file, given as a mime type. Unregistered or niche encoding and file formats can be indicated instead via the most appropriate URL, e.g. defining Web page or a Wikipedia/Wikidata entry.</td>
634634
</tr>
635635
<tr>
636-
<td><a href="https://schema.org/sameAs">sc:sameAs</a></td>
636+
<td><a href="http://schema.org/sameAs">sc:sameAs</a></td>
637637
<td><a href="http://schema.org/URL">URL</a></td>
638638
<td>MANY</td>
639639
<td>URL (or local name) of a FileObject with the same content, but in a different format.</td>
640640
</tr>
641641
<tr>
642-
<td><a href="https://schema.org/sha256">sc:sha256</a></td>
642+
<td><a href="http://schema.org/sha256">sc:sha256</a></td>
643643
<td><a href="http://schema.org/Text">Text</a></td>
644644
<td>ONE</td>
645645
<td>Checksum for the file contents.</td>
@@ -1016,7 +1016,7 @@ The ratings `RecordSet` above corresponds to a CSV table, declared elsewhere as
10161016

10171017
`RecordSet`s specify where to get their data via the `dataSource` property of Field. `DataSource` is the class describing the data that can be extracted from files to populate a `RecordSet`. This class should be used when the data coming from the source needs to be transformed or formatted to be included in the ML dataset; otherwise a simple `Reference` can be used instead to point to the source.
10181018

1019-
`DataSource` is a subclassOf: [sc:Intangible](https://schema.org/Intangible) and defines the following properties:
1019+
`DataSource` is a subclassOf: [sc:Intangible](http://schema.org/Intangible) and defines the following properties:
10201020

10211021
<table>
10221022
<thead>
@@ -1191,31 +1191,31 @@ Commonly used atomic data types:
11911191
<th>Usage</th>
11921192
</thead>
11931193
<tr>
1194-
<td><a href="https://schema.org/Boolean">sc:Boolean</a></td>
1194+
<td><a href="http://schema.org/Boolean">sc:Boolean</a></td>
11951195
<td>Describes a boolean.</td>
11961196
</tr>
11971197
<tr>
1198-
<td><a href="https://schema.org/Date">sc:Date</a></td>
1198+
<td><a href="http://schema.org/Date">sc:Date</a></td>
11991199
<td>Describes a date.</td>
12001200
</tr>
12011201
<tr>
1202-
<td><a href="https://schema.org/Time">sc:Time</a></td>
1202+
<td><a href="http://schema.org/Time">sc:Time</a></td>
12031203
<td>Describes a time.</td>
12041204
</tr>
12051205
<tr>
1206-
<td><a href="https://schema.org/DateTime">sc:DateTime</a></td>
1206+
<td><a href="http://schema.org/DateTime">sc:DateTime</a></td>
12071207
<td>Describes a combination of date and time of day.</td>
12081208
</tr>
12091209
<tr>
1210-
<td><a href="https://schema.org/Float">sc:Float</a></td>
1210+
<td><a href="http://schema.org/Float">sc:Float</a></td>
12111211
<td>Describes a float.</td>
12121212
</tr>
12131213
<tr>
1214-
<td><a href="https://schema.org/Integer">sc:Integer</a></td>
1214+
<td><a href="http://schema.org/Integer">sc:Integer</a></td>
12151215
<td>Describes an integer.</td>
12161216
</tr>
12171217
<tr>
1218-
<td><a href="https://schema.org/Text">sc:Text</a></td>
1218+
<td><a href="http://schema.org/Text">sc:Text</a></td>
12191219
<td>Describes a string.</td>
12201220
</tr>
12211221
</table>
@@ -1228,7 +1228,7 @@ Other data types commonly used in ML datasets:
12281228
<th>Usage</th>
12291229
</thead>
12301230
<tr>
1231-
<td><a href="https://schema.org/ImageObject">sc:ImageObject</a></td>
1231+
<td><a href="http://schema.org/ImageObject">sc:ImageObject</a></td>
12321232
<td>Describes a field containing the content of an image (pixels).</td>
12331233
</tr>
12341234
<tr>
@@ -1256,10 +1256,10 @@ Croissant datasets can use data types from other vocabularies, such as Wikidata.
12561256
</thead>
12571257
<tr>
12581258
<td>
1259-
<a href="https://www.wikidata.org/wiki/Q48277">wd:Q48277</a><br>
1259+
<a href="http://www.wikidata.org/wiki/Q48277">wd:Q48277</a><br>
12601260
(gender)
12611261
</td>
1262-
<td>Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: <a href="https://www.wikidata.org/wiki/Q6581097">wd:Q6581097</a>, <a href="https://www.wikidata.org/wiki/Q6581072">wd:Q6581072</a>, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example.</td>
1262+
<td>Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: <a href="http://www.wikidata.org/wiki/Q6581097">wd:Q6581097</a>, <a href="http://www.wikidata.org/wiki/Q6581072">wd:Q6581072</a>, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example.</td>
12631263
</tr>
12641264
</table>
12651265

@@ -1273,13 +1273,13 @@ In the following example, `color_sample` is a field containing an image, but wit
12731273
}
12741274
```
12751275

1276-
In the following example, the `url` field is expected to be a URL, whose semantic type is [City](https://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: "<https://www.wikidata.org/wiki/Q90>").
1276+
In the following example, the `url` field is expected to be a URL, whose semantic type is [City](http://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: "<http://www.wikidata.org/wiki/Q90>").
12771277

12781278
```json
12791279
{
12801280
"@id": "cities/url",
12811281
"@type": "cr:Field",
1282-
"dataType": ["https://schema.org/URL", "https://www.wikidata.org/wiki/Q515"]
1282+
"dataType": ["http://schema.org/URL", "http://www.wikidata.org/wiki/Q515"]
12831283
}
12841284
```
12851285

@@ -1527,7 +1527,7 @@ Annotations can also appear at the level of a RecordSet. A RecordSet-level annot
15271527
],
15281528
"annotation" : {
15291529
"@type": "cr:Field", "@id": "movies/ratings",
1530-
subField: [
1530+
"subField": [
15311531
{ "@type": "cr:Field", "@id": "movies/ratings/user_id", ...},
15321532
{ "@type": "cr:Field", "@id": "movies/ratings/rating", ...},
15331533
]
@@ -1582,9 +1582,9 @@ We now introduce a number of features that are useful in the context of ML data.
15821582

15831583
### Categorical Data
15841584

1585-
In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](https://schema.org/Enumeration) class from [schema.org](https://schema.org), as a `dataType` on `RecordSet`s that hold categorical data.
1585+
In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](http://schema.org/Enumeration) class from [schema.org](http://schema.org), as a `dataType` on `RecordSet`s that hold categorical data.
15861586

1587-
These RecordSets must define a `name` field conforming with the [sc:name](https://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance.
1587+
These RecordSets must define a `name` field conforming with the [sc:name](http://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance.
15881588

15891589
For example, the [COCO](https://cocodataset.org/#format-data) dataset defines categories and super-categories ([Croissant definition](https://github.com/mlcommons/croissant/blob/main/datasets/1.0/coco2014/metadata.json)), to which are associated other parts of the dataset. Using Croissant, one can describe the COCO super-categories the following way:
15901590

@@ -1861,8 +1861,8 @@ Segmentation mask as an image:
18611861
```json
18621862
"@context": {
18631863
"@language": "en",
1864-
"@vocab": "https://schema.org/",
1865-
"sc": "https://schema.org/",
1864+
"@vocab": "http://schema.org/",
1865+
"sc": "http://schema.org/",
18661866
"cr": "http://mlcommons.org/croissant/",
18671867
"rai": "http://mlcommons.org/croissant/RAI/",
18681868
"dct": "http://purl.org/dc/terms/",

0 commit comments

Comments
 (0)