We would like to create annotator tasks from a new collection of Open Access papers in the Hight Energy Particle (HEP) field. HEP is interesting because it is almost entirely available in OA (publication are founded by the SCOAP³ initiative), it is complementary to the fields we already cover and it relies significantly on scientific software.
The foreseen process is as follow:
- use arXiv OAI-PMH to get metadata for the preprint articles in the HEP sets (e.g.
hep-ex, hep-th, etc.)
- from this preprint metadata we get DOI of the published versions (which would be gold OA usually). Note that for the softcite dataset, we focus on published articles rather than preprint. We have already existing scripts to do that (https://github.com/kermitt2/article-dataset-builder) relying on CrossRef REST API in particular.
- thanks to Unpaywall, we get PDF url
- we harvest those PDF (still with https://github.com/kermitt2/article-dataset-builder)
- we process the PDF with the existing scripts under
code/corpus/ (and Grobid) to get JSON representation with software mentions
- we create annotator tasks from the JSON in csv
All the tools of this pipeline should be already available.
We would like to create annotator tasks from a new collection of Open Access papers in the Hight Energy Particle (HEP) field. HEP is interesting because it is almost entirely available in OA (publication are founded by the SCOAP³ initiative), it is complementary to the fields we already cover and it relies significantly on scientific software.
The foreseen process is as follow:
hep-ex,hep-th, etc.)code/corpus/(and Grobid) to get JSON representation with software mentionsAll the tools of this pipeline should be already available.