Skip to content

Commit 1687207

Browse files
authored
Merge pull request #2 from gdcc/1-field
stop repeating field over and over #1
2 parents dfed3b0 + 52bcb37 commit 1687207

3 files changed

Lines changed: 15 additions & 69 deletions

File tree

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,6 @@ Same as above but use a JVM option in domain.xml such as the example below.
152152
### Differences from Kaggle
153153

154154
- I see an `encodingFormat` of `text/comma-separated-values`. Kind of curious about that since I think `text/csv` is more the MIME type that's on https://www.iana.org/assignments/media-types/media-types.xhtml and https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types . See https://github.com/IQSS/dataverse/issues/4943#issuecomment-2145333830
155-
- One big difference I see is that you have many `recordSets` (and each one containing a single `field`) despite there being only 1 CSV. My understanding was that a `recordSet` maps roughly to a table and a `field` maps roughly to a column. So you'll see that our implementation has only 1 `recordSet` with many `field`s. This might be a good thing to get clarification on.
156155
- Another thing that sticks out is that I see all of the `field`s have a `dataType` of `sc:Integer`. But nearly all of the columns (excluding `quality` and `Id`) are `sc:Float`. On the Kaggle side, we have a column type of "Id" and so if that's set on a column, we set the `dataType` to `sc:Text` since Ids can often be non-numerical. Just a minor difference there, though, so nothing alarming to me personally.
157156

158157
### Differences from pyDataverse

src/main/java/io/gdcc/spi/export/croissant/CroissantExporter.java

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,8 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
193193
int fileCounter = 0;
194194
for (JsonValue jsonValue : datasetFileDetails) {
195195

196+
JsonObjectBuilder recordSetContent = Json.createObjectBuilder();
197+
recordSetContent.add("@type", "cr:RecordSet");
196198
JsonObject fileDetails = jsonValue.asJsonObject();
197199
/**
198200
* When there is an originalFileName, it means that the file has gone through ingest
@@ -306,9 +308,9 @@ public void exportDataset(ExportDataProvider dataProvider, OutputStream outputSt
306308
"fileObject",
307309
Json.createObjectBuilder()
308310
.add("@id", fileId))));
309-
fieldSetObject.add("field", fieldSetArray);
310-
recordSet.add(fieldSetObject);
311311
}
312+
recordSetContent.add("field", fieldSetArray);
313+
recordSet.add(recordSetContent);
312314
fileIndex++;
313315
}
314316
fileCounter++;

src/test/resources/cars/expected/cars-croissant.json

Lines changed: 11 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -126,12 +126,7 @@
126126
"@id": "data/stata13-auto.dta"
127127
}
128128
}
129-
}
130-
]
131-
},
132-
{
133-
"@type": "cr:RecordSet",
134-
"field": [
129+
},
135130
{
136131
"@type": "cr:Field",
137132
"name": "price",
@@ -143,12 +138,7 @@
143138
"@id": "data/stata13-auto.dta"
144139
}
145140
}
146-
}
147-
]
148-
},
149-
{
150-
"@type": "cr:RecordSet",
151-
"field": [
141+
},
152142
{
153143
"@type": "cr:Field",
154144
"name": "mpg",
@@ -160,12 +150,7 @@
160150
"@id": "data/stata13-auto.dta"
161151
}
162152
}
163-
}
164-
]
165-
},
166-
{
167-
"@type": "cr:RecordSet",
168-
"field": [
153+
},
169154
{
170155
"@type": "cr:Field",
171156
"name": "rep78",
@@ -177,12 +162,7 @@
177162
"@id": "data/stata13-auto.dta"
178163
}
179164
}
180-
}
181-
]
182-
},
183-
{
184-
"@type": "cr:RecordSet",
185-
"field": [
165+
},
186166
{
187167
"@type": "cr:Field",
188168
"name": "headroom",
@@ -194,12 +174,7 @@
194174
"@id": "data/stata13-auto.dta"
195175
}
196176
}
197-
}
198-
]
199-
},
200-
{
201-
"@type": "cr:RecordSet",
202-
"field": [
177+
},
203178
{
204179
"@type": "cr:Field",
205180
"name": "trunk",
@@ -211,12 +186,7 @@
211186
"@id": "data/stata13-auto.dta"
212187
}
213188
}
214-
}
215-
]
216-
},
217-
{
218-
"@type": "cr:RecordSet",
219-
"field": [
189+
},
220190
{
221191
"@type": "cr:Field",
222192
"name": "weight",
@@ -228,12 +198,7 @@
228198
"@id": "data/stata13-auto.dta"
229199
}
230200
}
231-
}
232-
]
233-
},
234-
{
235-
"@type": "cr:RecordSet",
236-
"field": [
201+
},
237202
{
238203
"@type": "cr:Field",
239204
"name": "length",
@@ -245,12 +210,7 @@
245210
"@id": "data/stata13-auto.dta"
246211
}
247212
}
248-
}
249-
]
250-
},
251-
{
252-
"@type": "cr:RecordSet",
253-
"field": [
213+
},
254214
{
255215
"@type": "cr:Field",
256216
"name": "turn",
@@ -262,12 +222,7 @@
262222
"@id": "data/stata13-auto.dta"
263223
}
264224
}
265-
}
266-
]
267-
},
268-
{
269-
"@type": "cr:RecordSet",
270-
"field": [
225+
},
271226
{
272227
"@type": "cr:Field",
273228
"name": "displacement",
@@ -279,12 +234,7 @@
279234
"@id": "data/stata13-auto.dta"
280235
}
281236
}
282-
}
283-
]
284-
},
285-
{
286-
"@type": "cr:RecordSet",
287-
"field": [
237+
},
288238
{
289239
"@type": "cr:Field",
290240
"name": "gear_ratio",
@@ -296,12 +246,7 @@
296246
"@id": "data/stata13-auto.dta"
297247
}
298248
}
299-
}
300-
]
301-
},
302-
{
303-
"@type": "cr:RecordSet",
304-
"field": [
249+
},
305250
{
306251
"@type": "cr:Field",
307252
"name": "foreign",

0 commit comments

Comments
 (0)