Skip to content

CORD 19 Semantic Annotation Projects

David Booth edited this page Apr 17, 2020 · 10 revisions

This page lists all known projects that are doing semantic annotation of the CORD-19 dataset. If you know of a project that is not listed here, please add it AND please contact David Booth, who chairs a weekly teleconference (11am Boston time) to coordinate and learn about each other's efforts.

Discussion of this work is on the public W3C Healthcare and Life Sciences mailing list, and teleconference minutes are posted there.

Project name: CORD-19-on-FHIR

Contact name and email: David Booth david@dbooth.org, Jiang, Guoqian, M.D., Ph.D. Jiang.Guoqian@mayo.edu, Harold Solbrig solbrig@jhu.edu

Description: We are currently doing NLP to extract Conditions, Medications and Procedures from title and abstract. We plan to expand this to also look at the article full text where available. We are also using Pubtator to extract Species, Gene, Disease, Chemical, CellLine, Mutation and Strain. The result is represented in FHIR RDF.

Data format(s): FHIR RDF

Data license: Our annotations are CC0 licensed.

Website or github URL: https://github.com/fhircat/CORD-19-on-FHIR

Slides: https://lists.w3.org/Archives/Public/www-archive/2020Apr/att-0002/CORD-19-on-FHIR-2020-0-31.pdf

Project name: CORD-ReDrugS

Contact name and email: Jim McCusker mccusj2@rpi.edu

Description: Enhance ReDrugS [1] to use extracted entities and relations from [2] to repurpose potential therapies.

[1] McCusker JP, Dumontier M, Yan R, He S, Dordick JS, McGuinness DL. 2017. Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Computer Science 3:e106 doi:10.7717/peerj-cs.106 [2] kaggle.com/yitongtseo/cord19-named-entities

Data format(s): RDF, SIO + PROV

Data license: TBD

Website or github URL:

Slides:

Project name: CORD-SEMANTICTRIPLES

Contact name and email: Scott Malec (scott.malec@gmail.com | sam413@pitt.edu)

Description: Computable knowledge extracted from the literature using machine reading can help researchers best understand and leverage the unprecedented volume of information gathered about the novel coronavirus. We hypothesize that machine interpretation techniques can be used to build graphical models of related concepts, with highly-connected nodes suggesting potentially plausible biological actors. We introduce a new resource, derived from the Semantic MEDLINE database (SemMedDB), reflecting documents also in the COVID-19 corpus. SemMedDB contains concept-relation-concept semantic triples, or predications. After extracting ~106K semantic predications, we imported these into a network and applied network centrality metrics (degree, closeness, betweenness) to identify and substantiate association factors related to COVID-19 for biological plausibility. Filtering the nodes by semantic type to search for drugs, drug targets, biomarkers, or comorbidities associated with complications, we were able to recapitulate agents already in randomized controlled trials for preventing or treating COVID-19 infections, comorbidities associated with lethal complications, many of which made sense upon further inspection. This guilt-by-association analysis demonstrates the value of the information revealed as computable knowledge by machine reading software.

Data format(s): RDF/XML, SQL, including Cytoscape-compatible formats (*.tsv, *.SIF)

Data license: TBD

Website or github URL: https://github.com/kingfish777/COVID19 still a mess. See the .cys file for a cytoscape-friendly version and the .xls spreadsheet for preliminary results. I will be uploading other formats, including the processing pipeline, and pointing to a more ambitious follow-up project applying computable knowledge derived using various machine reading frameworks of the COVID-19 corpus to support several practical use cases.

Draft paper: https://docs.google.com/document/d/1qQkLlvwOWOy1Rt7eUTKTCUfh-WodyXd8uv8UTnpCGoA/edit#

Slides:

Project name: ? (Tomáš Kliegr and Gollam Rabby)

Contact name and email: Tomáš Kliegr tomas.kliegr@vse.cz, Gollam Rabby rabby2186@gmail.com

Description: Tools for extracting associations from knowledge graphs and transaction data Presentation: https://docs.google.com/presentation/d/1eX9eTb0C8roy7pYK8li3V5YcWhFByN6is6hz_b62AcA/edit#slide=id.p

Data format(s): RDF, SQL Dumps, CSV

Data license: NA

Website or github URL:

  • JupyterLab notebook with existing code for entity extraction.
  • Demo of our web-based self-service EasyMiner tool for learning rules from single CSVs.
  • Demo of our web-based self-service RDFRules tool for learning rules from RDF KGs.

Slides:

Project name: COVID-KG

Contact name and email: Gilles Vandewiele gilles.vandewiele@ugent.be, Bram Steenwinckel bram.steenwickel@ugent.be

Description: Transform JSONs & CSV into RDF to create a Knowledge Graph that contains at least the same information as the original dataset, but with extra knowledge in addition in order to facilitate analysis of other researchers.

Data format(s): RDF

Data license: TBD, as open as possible.

Website or github URL: http://github.com/GillesVandewiele/COVID-KG/ & https://www.kaggle.com/group16/covid19-literature-knowledge-graph

Slides:

Project name: ? (Jin-Dong Kim)

Contact name and email: Jin-Dong Kim jindong.kim@gmail.com

Description:

Data format(s):

Data license:

Website or github URL:

Slides:

Project name: ? (James Malone)

Contact name and email: James Malone james@scibite.com

Description:

Data format(s):

Data license:

Website or github URL:

Slides:

Project name: CORD-19 Named Entities Knowledge Graph (CORD19-NEKG)

Contact name and email: Fabien Gandon fabien.gandon@inria.fr, Franck Michel fmichel@i3s.unice.fr

Description: CORD-19 Named Entities Knowledge Graph (CORD19-NEKG) is an RDF dataset describing named entities identified in the scholarly articles of the COVID-19 Open Research Dataset (CORD-19). CORD19-NEKG is an initiative of the Wimmics team. RDF files are generated using Morph-xR2RML, an implementation of the xR2RML mapping language.

Data format(s): RDF Turtle

Data license: CORD19 license for the part of the dataset which is just an RDF version of CORD19 metadata, Open Data Commons Attribution License (http://opendatacommons.org/licenses/by/1.0) for the annotations that we produced.

Website or github URL: https://github.com/Wimmics/cord19-nekg Download: https://github.com/Wimmics/cord19-nekg/tree/master/dataset SPARQL endpoint: https://covid19.i3s.unice.fr/sparql

Slides:

Project name: CovidChallenge? (provisional)

Contact name and email: Victor Mireles victor.mireles@semantic-web.com

Description: RDFizing several annotations on the Cord19 dataset (kaggle’s paper) that are around in different vocabularies. Current vocabularies: gene ontology, ChEBI, human phenotype ontology, MeSH disease.

Data format(s): RDF

Data license: Coming soon

Website or github URL: Coming soon

Slides:

Clone this wiki locally