-
Notifications
You must be signed in to change notification settings - Fork 8
CORD 19 Semantic Annotation Projects
This page lists all known projects that are doing semantic annotation of the CORD-19 dataset. If you know of a project that is not listed here, please add it AND please contact David Booth, who chairs a weekly teleconference (11am Boston time) to coordinate and learn about each other's efforts.
Discussion of this work is on the public W3C Healthcare and Life Sciences mailing list, and teleconference minutes are posted there.
Contact name and email: David Booth david@dbooth.org, Jiang, Guoqian, M.D., Ph.D. Jiang.Guoqian@mayo.edu, Harold Solbrig solbrig@jhu.edu
Description: We are currently doing NLP to extract Conditions, Medications and Procedures from title and abstract. We plan to expand this to also look at the article full text where available. We are also using Pubtator to extract Species, Gene, Disease, Chemical, CellLine, Mutation and Strain. The result is represented in FHIR RDF.
Data format(s): FHIR RDF
Data license: Our annotations are CC0 licensed.
Website or github URL: https://github.com/fhircat/CORD-19-on-FHIR
Slides: https://lists.w3.org/Archives/Public/www-archive/2020Apr/att-0002/CORD-19-on-FHIR-2020-0-31.pdf
Contact name and email: Jim McCusker mccusj2@rpi.edu
Description: Enhance ReDrugS [1] to use extracted entities and relations from [2] to repurpose potential therapies.
[1] McCusker JP, Dumontier M, Yan R, He S, Dordick JS, McGuinness DL. 2017. Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Computer Science 3:e106 doi:10.7717/peerj-cs.106 [2] kaggle.com/yitongtseo/cord19-named-entities
Data format(s): RDF, SIO + PROV
Data license: TBD
Website or github URL:
Slides:
Contact name and email: Scott Malec (scott.malec@gmail.com | sam413@pitt.edu)
Description: Computable knowledge extracted from the literature using machine reading can help researchers best understand and leverage the unprecedented volume of information gathered about the novel coronavirus. We hypothesize that machine interpretation techniques can be used to build graphical models of related concepts, with highly-connected nodes suggesting potentially plausible biological actors. We introduce a new resource, derived from the Semantic MEDLINE database (SemMedDB), reflecting documents also in the COVID-19 corpus. SemMedDB contains concept-relation-concept semantic triples, or predications. After extracting ~106K semantic predications, we imported these into a network and applied network centrality metrics (degree, closeness, betweenness) to identify and substantiate association factors related to COVID-19 for biological plausibility. Filtering the nodes by semantic type to search for drugs, drug targets, biomarkers, or comorbidities associated with complications, we were able to recapitulate agents already in randomized controlled trials for preventing or treating COVID-19 infections, comorbidities associated with lethal complications, many of which made sense upon further inspection. This guilt-by-association analysis demonstrates the value of the information revealed as computable knowledge by machine reading software.
Data format(s): RDF/XML, SQL, including Cytoscape-compatible formats (*.tsv, *.SIF)
Data license: TBD
Website or github URL: https://github.com/kingfish777/COVID19 still a mess. See the .cys file for a cytoscape-friendly version and the .xls spreadsheet for preliminary results. I will be uploading other formats, including the processing pipeline, and pointing to a more ambitious follow-up project applying computable knowledge derived using various machine reading frameworks of the COVID-19 corpus to support several practical use cases.
Draft paper: https://docs.google.com/document/d/1qQkLlvwOWOy1Rt7eUTKTCUfh-WodyXd8uv8UTnpCGoA/edit#
Slides:
Contact name and email: Tomáš Kliegr tomas.kliegr@vse.cz, Gollam Rabby rabby2186@gmail.com
Description: Tools for extracting associations from knowledge graphs and transaction data Presentation: https://docs.google.com/presentation/d/1eX9eTb0C8roy7pYK8li3V5YcWhFByN6is6hz_b62AcA/edit#slide=id.p
Data format(s): RDF, SQL Dumps, CSV
Data license: NA
Website or github URL:
- JupyterLab notebook with existing code for entity extraction.
- Demo of our web-based self-service EasyMiner tool for learning rules from single CSVs.
- Demo of our web-based self-service RDFRules tool for learning rules from RDF KGs.
Slides:
Contact name and email: Gilles Vandewiele gilles.vandewiele@ugent.be, Bram Steenwinckel bram.steenwickel@ugent.be
Description: Transform JSONs & CSV into RDF to create a Knowledge Graph that contains at least the same information as the original dataset, but with extra knowledge in addition in order to facilitate analysis of other researchers.
Data format(s): RDF
Data license: TBD, as open as possible.
Website or github URL: http://github.com/GillesVandewiele/COVID-KG/ & https://www.kaggle.com/group16/covid19-literature-knowledge-graph
Slides:
Contact name and email: Jin-Dong Kim jindong.kim@gmail.com
Description:
Data format(s):
Data license:
Website or github URL:
Slides:
Contact name and email: James Malone james@scibite.com
Description:
Data format(s):
Data license:
Website or github URL:
Slides:
Contact name and email: Fabien Gandon fabien.gandon@inria.fr, Franck Michel fmichel@i3s.unice.fr
Description: CORD-19 Named Entities Knowledge Graph (CORD19-NEKG) is an RDF dataset describing named entities identified in the scholarly articles of the COVID-19 Open Research Dataset (CORD-19). CORD19-NEKG is an initiative of the Wimmics team. RDF files are generated using Morph-xR2RML, an implementation of the xR2RML mapping language.
Data format(s): RDF Turtle
Data license: CORD19 license for the part of the dataset which is just an RDF version of CORD19 metadata, Open Data Commons Attribution License (http://opendatacommons.org/licenses/by/1.0) for the annotations that we produced.
Website or github URL: https://github.com/Wimmics/cord19-nekg Download: https://github.com/Wimmics/cord19-nekg/tree/master/dataset SPARQL endpoint: https://covid19.i3s.unice.fr/sparql
Slides:
Contact name and email: Victor Mireles victor.mireles@semantic-web.com
Description: RDFizing several annotations on the Cord19 dataset (kaggle’s paper) that are around in different vocabularies. Current vocabularies: gene ontology, ChEBI, human phenotype ontology, MeSH disease.
Data format(s): RDF
Data license: Coming soon
Website or github URL: Coming soon
Slides: