Skip to content

Create annotator tasks for a HEP collection #673

@kermitt2

Description

@kermitt2

We would like to create annotator tasks from a new collection of Open Access papers in the Hight Energy Particle (HEP) field. HEP is interesting because it is almost entirely available in OA (publication are founded by the SCOAP³ initiative), it is complementary to the fields we already cover and it relies significantly on scientific software.

The foreseen process is as follow:

  • use arXiv OAI-PMH to get metadata for the preprint articles in the HEP sets (e.g. hep-ex, hep-th, etc.)
  • from this preprint metadata we get DOI of the published versions (which would be gold OA usually). Note that for the softcite dataset, we focus on published articles rather than preprint. We have already existing scripts to do that (https://github.com/kermitt2/article-dataset-builder) relying on CrossRef REST API in particular.
  • thanks to Unpaywall, we get PDF url
  • we harvest those PDF (still with https://github.com/kermitt2/article-dataset-builder)
  • we process the PDF with the existing scripts under code/corpus/ (and Grobid) to get JSON representation with software mentions
  • we create annotator tasks from the JSON in csv

All the tools of this pipeline should be already available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions