Skip to content

Training Data #10

@ksuderman

Description

@ksuderman

Most of the low hanging fruit has been harvested from the Zotero spreadsheet and can be found in /data/corpura/curation/gold on the Lappsgrid server. These files are (should be) all true positives. For training purposes we also need negative examples.

Possible sources of negative examples

  1. Zotero spreadsheet
    • query crossref.org to find DOI record from title and authors
    • check if DOI record has a download link for TDM
  2. Query crossref.org
    • find DOI for all articles that contain the string "galaxy"
    • remove articles that appear in the Zotero spreadsheet
    • How do we determine if remaining articles are negatives or just positives we just don't know about yet?

@nancyide please provide feedback on the size of the training set we should aim for and the ratio of positives and negatives.

Metadata

Metadata

Labels

taskSomething that needs doing.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions