Skip to content

Commit 1a79219

Browse files
authored
Merge pull request #42 from IQSS/custom-size-image-dataset
Custom size image dataset
2 parents 23e78da + 28ee998 commit 1a79219

8 files changed

Lines changed: 188 additions & 1 deletion

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,6 @@ __pycache__
33
dvconfig.py
44
ec2-create-instance.sh
55
venv
6+
dv_logo_*.png
7+
*.DS_Store
8+
sample.sh

README.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Activate the virtual environment you just created.
2626

2727
Install dependencies into the virtual environment, especially [pyDataverse][].
2828

29-
pip install -r requirements.txt
29+
pip3 install -r requirements.txt
3030

3131
Copy `dvconfig.py.sample` to `dvconfig.py` (see the `cp` command below) and add your API token (using your favorite text editor, which may not be `vi` as shown below). Note that the config file specifies which sample data will be created.
3232

@@ -35,12 +35,29 @@ Copy `dvconfig.py.sample` to `dvconfig.py` (see the `cp` command below) and add
3535

3636
Note that the environment variable `$API_TOKEN` will override `api_token` in `dvconfig.py`.
3737

38+
## Adding a custom dataset with specific number of files
39+
40+
You can add a specific number of files to the dataset "Dataverse performance test dataset" with:
41+
42+
python create_sample_custom_dataset.py
43+
44+
You'll be prompted to specify the number of files you wish to create. The application will then generate the requested number of files, each one with the Dataverse logo in a randomly chosen color. These files will be in PNG format. It's important to complete this step before adding any data, as the dataset will otherwise be empty.
45+
46+
If you experience the `OSError: no library called "cairo-2" was found` error please declare the following env variable as documented [here](https://github.com/Kozea/CairoSVG/issues/392#issuecomment-1927435606
47+
):
48+
49+
export DYLD_LIBRARY_PATH="/opt/homebrew/opt/cairo/lib:$DYLD_LIBRARY_PATH"
50+
3851
## Adding sample data
3952

4053
Assuming you have already run the `source` and `cd` commands above, you should be able to run the following command to create sample data.
4154

4255
python create_sample_data.py
4356

57+
https://github.com/Kozea/CairoSVG/issues/392#issuecomment-1927435606
58+
59+
export DYLD_LIBRARY_PATH="/opt/homebrew/opt/cairo/lib:$DYLD_LIBRARY_PATH"
60+
4461
All of the steps above may be automated in a fresh installation of Dataverse on an EC2 instance on AWS by downloading [ec2-create-instance.sh][] and [main.yaml][]. Edit main.yml to set `dataverse.sampledata.enabled: true` and adjust any other settings to your liking, then execute the script with the config file like this:
4562

4663
curl -O https://raw.githubusercontent.com/GlobalDataverseCommunityConsortium/dataverse-ansible/master/ec2/ec2-create-instance.sh

create_sample_custom_dataset.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import random
2+
import re
3+
import cairosvg
4+
5+
#from CairoSVG import svg2png
6+
7+
generated_files = input('Number of files to generate: ')
8+
target_path = './data/dataverses/dataverse-performance-demo/datasets/performance-test/files'
9+
10+
with open('dv_logo_hd.svg', 'r') as file:
11+
svg_code = file.read()
12+
13+
for iteration in range(int(generated_files)):
14+
random_color = '#' + ''.join(random.choices('0123456789ABCDEF', k=6))
15+
svg_code_tmp = re.sub(r'#c65b28', random_color, svg_code)
16+
destination_path = (
17+
f"{target_path}/dv_logo_{str(iteration).zfill(5)}.png"
18+
)
19+
cairosvg.svg2png(bytestring=svg_code_tmp, write_to=destination_path)
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
{
2+
"datasetVersion": {
3+
"id": 4,
4+
"datasetId": 12,
5+
"datasetPersistentId": "doi:10.5072/FK2/JPT050",
6+
"storageIdentifier": "file://10.5072/FK2/JPT050",
7+
"versionNumber": 1,
8+
"versionMinorNumber": 0,
9+
"versionState": "RELEASED",
10+
"UNF": "UNF:6:VDyWtJrNd0VRwAumtzYA1Q==",
11+
"lastUpdateTime": "2021-09-20T18:38:32Z",
12+
"releaseTime": "2021-09-20T18:38:32Z",
13+
"createTime": "2021-09-20T18:16:38Z",
14+
"license": "CC0 1.0",
15+
"termsOfUse": "CC0 Waiver",
16+
"fileAccessRequest": false,
17+
"metadataBlocks": {
18+
"citation": {
19+
"displayName": "Citation Metadata",
20+
"fields": [
21+
{
22+
"typeName": "title",
23+
"multiple": false,
24+
"typeClass": "primitive",
25+
"value": "Dataverse performance test dataset"
26+
},
27+
{
28+
"typeName": "author",
29+
"multiple": true,
30+
"typeClass": "compound",
31+
"value": [
32+
{
33+
"authorName": {
34+
"typeName": "authorName",
35+
"multiple": false,
36+
"typeClass": "primitive",
37+
"value": "Juan Pablo Tosca Villanueva"
38+
}
39+
}
40+
]
41+
},
42+
{
43+
"typeName": "datasetContact",
44+
"multiple": true,
45+
"typeClass": "compound",
46+
"value": [
47+
{
48+
"datasetContactName": {
49+
"typeName": "datasetContactName",
50+
"multiple": false,
51+
"typeClass": "primitive",
52+
"value": "Juan Pablo Tosca Villanueva"
53+
},
54+
"datasetContactEmail": {
55+
"typeName": "datasetContactEmail",
56+
"multiple": false,
57+
"typeClass": "primitive",
58+
"value": "dataverse@mailinator.com"
59+
}
60+
}
61+
]
62+
},
63+
{
64+
"typeName": "dsDescription",
65+
"multiple": true,
66+
"typeClass": "compound",
67+
"value": [
68+
{
69+
"dsDescriptionValue": {
70+
"typeName": "dsDescriptionValue",
71+
"multiple": false,
72+
"typeClass": "primitive",
73+
"value": "This is a test dataset to measure the performance of the Dataverse software."
74+
}
75+
}
76+
]
77+
},
78+
{
79+
"typeName": "subject",
80+
"multiple": true,
81+
"typeClass": "controlledVocabulary",
82+
"value": [
83+
"Social Sciences"
84+
]
85+
},
86+
{
87+
"typeName": "depositor",
88+
"multiple": false,
89+
"typeClass": "primitive",
90+
"value": "Admin, Dataverse"
91+
},
92+
{
93+
"typeName": "dateOfDeposit",
94+
"multiple": false,
95+
"typeClass": "primitive",
96+
"value": "2024-05-05"
97+
}
98+
]
99+
}
100+
},
101+
"citation": "IQSS, 2024, \"Dataverse performance test\", https://doi.org/10.5072/FK2/JPT050, Root, V1"
102+
}
103+
}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"name": "Dataverse performance demo",
3+
"alias": "dataverse-performance-demo",
4+
"dataverseContacts": [
5+
{
6+
"contactEmail": "juan_tosca@iq.harvard.edu"
7+
}
8+
],
9+
"affiliation": "Harvard University",
10+
"description": "Demo created for performance testing",
11+
"dataverseType": "RESEARCH_PROJECTS"
12+
}

dv_logo_hd.svg

Lines changed: 30 additions & 0 deletions
Loading

dvconfig.py.sample

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ sample_data = [
2525
'data/dataverses/open-source-at-harvard/datasets/open-source-at-harvard/open-source-at-harvard.json',
2626
'data/dataverses/king/king.json',
2727
'data/dataverses/king/datasets/cause-of-death/cause-of-death.json',
28+
'data/dataverses/dataverse-performance-demo/dataverse-performance-demo.json',
29+
'data/dataverses/dataverse-performance-demo/datasets/performance-test/performance-test.json',
2830
]
2931

3032
# put this back at line 6 once https://github.com/IQSS/dataverse/pull/6924 is merged

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
pyDataverse==0.2.1
2+
CairoSVG==2.7.1

0 commit comments

Comments
 (0)