diff --git a/docs/croissant-spec-draft.md b/docs/croissant-spec-draft.md index 887c69b9c..26c92f5c6 100644 --- a/docs/croissant-spec-draft.md +++ b/docs/croissant-spec-draft.md @@ -3,7 +3,7 @@ Version 1.1 (draft) **This is a draft of the Croissant 1.1 specification. This document is a work in progress. - For the latest official specification, please see the [Croissant 1.0 specification](https://http://mlcommons.org/croissant/1.0).** + For the latest official specification, please see the [Croissant 1.0 specification](http://mlcommons.org/croissant/1.0).** @@ -54,7 +54,7 @@ Creating or changing the metadata is straightforward. A dataset repository can i ### Responsible AI -As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://link.springer.com/book/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation. +As AI advances at rapid speed there is increased recognition among researchers, practitioners and policy makers that we need to explore, understand, manage, and assess [its economic, social, and environmental impacts](https://doi.org/10.1007/978-3-030-30371-6). One of the main instruments to operationalise responsible AI (RAI) is dataset documentation. This is how Croissant helps address RAI: @@ -84,7 +84,7 @@ Croissant metadata is encoded in JSON-LD. { "@context": { "@language": "en", - "@vocab": "https://schema.org/" + "@vocab": "http://schema.org/" }, "@type": "sc:Dataset", "name": "simple-pass", @@ -405,7 +405,7 @@ These [schema.org](http://schema.org) properties are recommended for every Crois keywords - DefinedTerm
+ DefinedTerm
Text
URL @@ -466,8 +466,8 @@ These [schema.org](http://schema.org) properties are recommended for every Crois inLanguage - Language
- Text + Language
+ Text MANY The language(s) of the content of the dataset. @@ -608,37 +608,37 @@ Most of the important properties needed to describe a `FileObject` are defined i Description - sc:name + sc:name Text ONE The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip". - sc:contentUrl + sc:contentUrl URL ONE Actual bytes of the media object, for example the image file or video file. - sc:contentSize + sc:contentSize Text ONE File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified. - sc:encodingFormat + sc:encodingFormat Text MANY The formats of the file, given as a mime type. Unregistered or niche encoding and file formats can be indicated instead via the most appropriate URL, e.g. defining Web page or a Wikipedia/Wikidata entry. - sc:sameAs + sc:sameAs URL MANY URL (or local name) of a FileObject with the same content, but in a different format. - sc:sha256 + sc:sha256 Text ONE Checksum for the file contents. @@ -1015,7 +1015,7 @@ The ratings `RecordSet` above corresponds to a CSV table, declared elsewhere as `RecordSet`s specify where to get their data via the `dataSource` property of Field. `DataSource` is the class describing the data that can be extracted from files to populate a `RecordSet`. This class should be used when the data coming from the source needs to be transformed or formatted to be included in the ML dataset; otherwise a simple `Reference` can be used instead to point to the source. -`DataSource` is a subclassOf: [sc:Intangible](https://schema.org/Intangible) and defines the following properties: +`DataSource` is a subclassOf: [sc:Intangible](http://schema.org/Intangible) and defines the following properties: @@ -1190,31 +1190,31 @@ Commonly used atomic data types: - + - + - + - + - + - + - +
Usage
sc:Booleansc:Boolean Describes a boolean.
sc:Datesc:Date Describes a date.
sc:Timesc:Time Describes a time.
sc:DateTimesc:DateTime Describes a combination of date and time of day.
sc:Floatsc:Float Describes a float.
sc:Integersc:Integer Describes an integer.
sc:Textsc:Text Describes a string.
@@ -1227,7 +1227,7 @@ Other data types commonly used in ML datasets: Usage - sc:ImageObject + sc:ImageObject Describes a field containing the content of an image (pixels). @@ -1251,10 +1251,10 @@ Croissant datasets can use data types from other vocabularies, such as Wikidata. - wd:Q48277
+ wd:Q48277
(gender) - Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: wd:Q6581097, wd:Q6581072, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example. + Describes a Field or a RecordSet whose values are indicative of someone’s gender. This could be used for instance by RAI frameworks and tools to flag possible biases in the data. Values for this RecordSet can be associated with specific gender URLs (eg: wd:Q6581097, wd:Q6581072, etc.). Refer to the "Typed RecordSets > Enumerations" section for an example. @@ -1268,13 +1268,13 @@ In the following example, `color_sample` is a field containing an image, but wit } ``` -In the following example, the `url` field is expected to be a URL, whose semantic type is [City](https://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: ""). +In the following example, the `url` field is expected to be a URL, whose semantic type is [City](http://www.wikidata.org/wiki/Q515), so one will expect values of this field to be URLs referring to cities (e.g.: ""). ```json { "@id": "cities/url", "@type": "cr:Field", - "dataType": ["https://schema.org/URL", "https://www.wikidata.org/wiki/Q515"] + "dataType": ["http://schema.org/URL", "http://www.wikidata.org/wiki/Q515"] } ``` @@ -1522,7 +1522,7 @@ Annotations can also appear at the level of a RecordSet. A RecordSet-level annot ], "annotation" : { "@type": "cr:Field", "@id": "movies/ratings", - subField: [ + "subField": [ { "@type": "cr:Field", "@id": "movies/ratings/user_id", ...}, { "@type": "cr:Field", "@id": "movies/ratings/rating", ...}, ] @@ -1577,9 +1577,9 @@ We now introduce a number of features that are useful in the context of ML data. ### Categorical Data -In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](https://schema.org/Enumeration) class from [schema.org](https://schema.org), as a `dataType` on `RecordSet`s that hold categorical data. +In machine learning applications, it's often useful to know that some of the data is categorical in nature, and has a finite set of values that can be used, say, for classification. Croissant represents that information by using the [sc:Enumeration](http://schema.org/Enumeration) class from [schema.org](http://schema.org), as a `dataType` on `RecordSet`s that hold categorical data. -These RecordSets must define a `name` field conforming with the [sc:name](https://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance. +These RecordSets must define a `name` field conforming with the [sc:name](http://schema.org/name) definition, i.e. a human-readable text naming the item. They must also specify a key to identify each possible instance. Enumerations should have a `url` field, which can also be used to uniquely refer to each instance. For example, the [COCO](https://cocodataset.org/#format-data) dataset defines categories and super-categories ([Croissant definition](https://github.com/mlcommons/croissant/blob/main/datasets/1.0/coco2014/metadata.json)), to which are associated other parts of the dataset. Using Croissant, one can describe the COCO super-categories the following way: @@ -1840,8 +1840,8 @@ Segmentation mask as an image: ```json "@context": { "@language": "en", - "@vocab": "https://schema.org/", - "sc": "https://schema.org/", + "@vocab": "http://schema.org/", + "sc": "http://schema.org/", "cr": "http://mlcommons.org/croissant/", "rai": "http://mlcommons.org/croissant/RAI/", "dct": "http://purl.org/dc/terms/",