Skip to content
This repository was archived by the owner on Oct 20, 2023. It is now read-only.

CORD 19 Semantic Annotation Projects

David Booth edited this page Oct 14, 2020 · 43 revisions

This page lists projects that are doing semantic annotation of the CORD-19 dataset. If you know of a project that is not listed here, please add it AND please contact David Booth, who chairs a semi-weekly teleconference (11am Boston time) to coordinate and learn about each other's efforts.

Teleconferences are announced on the public W3C Healthcare and Life Sciences mailing list.

Table of Contents

2020-09-29 Houcemeddine Turki, University of Sfax, Tunisia: Wikidata and COVID-19, Creating a collaborative knowledge graph from CORD-19 scholarly publications

Contact name and email: Houcemeddine A. Turki turkiabdelwaheb@hotmail.fr

Description: Knowledge graphs are an essential ingredient for information systems to handle the ever growing COVID-19 data on a daily basis. This presentation explains how open and collaborative FAIR knowledge bases like Wikidata can be useful to create a large-scale semantic representation of COVID-19 information from CORD-19 scholarly publications. I give an overview of how a data model has been collaboratively developed and maintained for COVID-19 knowledge, and I provide a detailed snapshot about the various methods used to extract items and statements from CORD-19 research papers. Then, I outline the tools for the enrichment of COVID-19 information on Wikidata as well as the knowledge graph validation methods applicable to COVID-19 knowledge. Finally, I describe the COVID-19 information in Wikidata and discuss its usefulness in supporting human decisions and social recommendations about the infectious disease.

Data format(s): RDF

Data license: CC0

Website or github URL: https://www.wikidata.org/wiki/Wikidata:WikiProject_COVID-19

Draft paper: https://zenodo.org/record/4033382 and https://zenodo.org/record/4008359

Slides: https://commons.wikimedia.org/wiki/File:W3_CORD-19_-_Wikidata_and_COVID-19.pdf

Video presentation (recorded 2020-09-29): https://youtu.be/TwudGFtT4A4

Chat comments made during presentation: It is possible to link to the individual phrase from which a Wikidata statement originated. Demo at the reference for the “schizophrenia” claim in https://www.wikidata.org/w/index.php?title=Q13561329&oldid=1278213857#P1910 . It has a “reference URL” that points to https://via.hypothes.is/https://pubmed.ncbi.nlm.nih.gov/11126396/#annotations:HVXGMnfUEeetV2sj-_VpSQ . We are using existing vocabularies to the extent possible. SNOMED-CT and many others, however, are not openly licensed, so we cannot incorporate them wholesale. What we can do, though, is mapping. Th ORKG structured annotations are a demo for COVID, they do not yet work at scale. Here is an example of such an argumentation-focused knowledge graph: https://hi-knowledge.org/ It focuses on invasion biology for now.

2020-07-28 Marcin Joachimiak, Lawrence Berkeley Natl Lab: KG-COVID-19, A knowledge graph for COVID-19 response

Contact name and email: Marcin Joachimiak marcinjoachimiak@gmail.com

Description:

Data format(s):

Data license:

Website or github URL:

Paper: https://www.biorxiv.org/content/10.1101/2020.08.17.254839v1

Slides: https://docs.google.com/presentation/d/1_QVXyHeJMiFUHInaKlS8db9Yz0fpZUhK5aSG18JLFtw/edit#slide=id.g8cdd816443_0_877

Video presentation (recorded 2020-07-28): https://youtu.be/iGqRvKhuuSs

2020-07-21 Michael Liebman, IPQ Analytics: Modeling COVID-19, from the clinic back

Contact name and email: Michael Liebman michael.liebman@ipqanalytics.com

Description:

Data format(s):

Data license:

Website or github URL:

Draft paper:

Slides:

Video presentation (recorded 2020-07-21): https://youtu.be/ueJunueo3hg

2020-06-23 Jin-Dong Kim Covid19-PubAnnotation

Contact name and email: Jin-Dong Kim jindong.kim@gmail.com

Description: PubAnnotation is a repository of text annotations, especially those made to literature of life sciences, e.g., PubMed or PMC articles. If one has such annotations, they can be registered in PubAnnotation. When annotations are registered, PubAnnotation aligns them to the canonical text that is taken from PubMed and PMC, which means all the annotations in PubAnnotation are linked to each other through canonical texts. It is a new way of publishing or sharing text annotations using recent web technology: annotations will become accessible and searchable through standard web protocol, e.g., REST API.

Data format(s):

Data license:

Website or github URL: https://pubannotation.org/

Slides: https://docs.google.com/presentation/d/1ynoe1Xxc_-rTiebbvvuPBQMaktK-DX87McuDVaLbI1g/edit#slide=id.g726dbf02a0_0_0

Video presentation (recorded 2020-06-23): https://youtu.be/LKz9kRtLi9I

2020-06-16 Victor Mireles, Semantic Web Company: COVID-19 Knowledge Graph.

Contact name and email: Victor Mireles-Chavez victor.mireles-chavez@semantic-web.com

Description:

Data format(s):

Data license:

Website or github URL:

Draft paper:

Slides: https://docs.google.com/presentation/d/1xaS_88sJ47iSrvv0ezOfjscIvG2VINUe7vqrUEMiaCA/edit?usp=sharing

Video presentation (Recorded 2020-06-16): https://youtu.be/HnoTDndSK_A

2020-06-16 Feichen Shen and David Oniani, Mayo Clinic: Constructing Co-occurrence Network Embeddings

Contact name and email: Feichen Shen, Ph.D. Shen.Feichen@mayo.edu and David Oniani Oniani.David@mayo.edu

Description: Constructing Co-occurrence Network Embeddings to Assist Associate Extraction for COVID-19 and Other Coronavirus Infectious Diseases

Data format(s):

Data license:

Website or github URL:

Slides: https://www.davidoniani.com/research/co-occurence-network-embeddings-presentation.pdf

Video presentation (recorded 16-Jun-2020): https://youtu.be/RxEsBP40OxE

2020-06-02 Scott Malec, University of Pittsburgh: CORD-SEMANTICTRIPLES / Machine Reading for COVID-19 and Alzheimer's

Contact name and email: Scott Malec (scott.malec@gmail.com | sam413@pitt.edu)

Description: Computable knowledge extracted from the literature using machine reading can help researchers best understand and leverage the unprecedented volume of information gathered about the novel coronavirus. We hypothesize that machine interpretation techniques can be used to build graphical models of related concepts, with highly-connected nodes suggesting potentially plausible biological actors. We introduce a new resource, derived from the Semantic MEDLINE database (SemMedDB), reflecting documents also in the COVID-19 corpus. SemMedDB contains concept-relation-concept semantic triples, or predications. After extracting ~106K semantic predications, we imported these into a network and applied network centrality metrics (degree, closeness, betweenness) to identify and substantiate association factors related to COVID-19 for biological plausibility. Filtering the nodes by semantic type to search for drugs, drug targets, biomarkers, or comorbidities associated with complications, we were able to recapitulate agents already in randomized controlled trials for preventing or treating COVID-19 infections, comorbidities associated with lethal complications, many of which made sense upon further inspection. This guilt-by-association analysis demonstrates the value of the information revealed as computable knowledge by machine reading software.

Data format(s): RDF/XML, SQL, including Cytoscape-compatible formats (*.tsv, *.SIF)

Data license: TBD

Website or github URL: https://github.com/kingfish777/COVID19 still a mess. See the .cys file for a cytoscape-friendly version and the .xls spreadsheet for preliminary results. I will be uploading other formats, including the processing pipeline, and pointing to a more ambitious follow-up project applying computable knowledge derived using various machine reading frameworks of the COVID-19 corpus to support several practical use cases.

Draft paper: https://docs.google.com/document/d/1qQkLlvwOWOy1Rt7eUTKTCUfh-WodyXd8uv8UTnpCGoA/edit#

Slides: https://docs.google.com/presentation/d/13upacoOuKXhguToT-z2MNPE_iDJWY0vvbFz8NDAWpaQ/edit?usp=sharing

Video presentation (recorded June 2, 2020): https://www.youtube.com/watch?v=ydnx_Rg1PYs

2020-06-02 Pedro Szekely, USC Information Sciences Institute: A Knowledge Graph Integrating Annotations On 20,000 COVID-19 Scientific Articles

Contact name and email: Pedro Szekely szekely@usc.edu, USC Information Sciences Institute

Description:

Data format(s):

Data license:

Website or github URL: https://github.com/usc-isi-i2/kgtk

Paper: https://arxiv.org/abs/2006.00088

Jupyter Notebook with example on how to create the COVID-19 KG using KGTK: https://github.com/usc-isi-i2/CKG-COVID-19/blob/dev/build-covid-kg.ipynb

Slides: https://docs.google.com/presentation/d/1_uFKP6xmcV0rYjqVxorEI97weN4uauPxG11YHTo6tD8/edit?usp=sharing

Video presentation (recorded June 2, 2020): https://www.youtube.com/watch?v=ydnx_Rg1PYs&t=2346s

2020-05-26 Oliver Giles, SciBite: TERMite CORD19

Contact name and email: Oliver Giles oliver.giles@scibite.com and James Malone james@scibite.com, SciBite

Description:

Data format(s):

Data license:

Website or github URL: https://www.scibite.com/

Slides: https://lists.w3.org/Archives/Public/www-archive/2020May/att-0003/covid19.pdf

Video presentation (recorded May 26, 2020): https://www.youtube.com/watch?v=3IdkRU9Durc

2020-05-19 Gaurav Vaidya: OmniCORD

Contact name and email: Gaurav Vaidya, http://www.ggvaidya.com/

Description:

Data format(s):

Data license:

Website or github URL: https://github.com/NCATS-Gamma/omnicorp

Slides: https://docs.google.com/presentation/d/1ghAqVwgrCO6moGyWNSZfRBApMZfJnoqa9Z5NwhRF53g/edit#slide=id.gc6f980f91_0_0

Video presentation (recorded May 19, 2020): https://www.youtube.com/watch?v=YcoG9H6r7R0&t=9s

2020-05-19 Gollam Rabby, VSE University, Prague: Entity-Based-Document-Classification-on-the-CORD---19-Corpus

Contact name and email: Tomáš Kliegr tomas.kliegr@vse.cz, Gollam Rabby rabby2186@gmail.com

Description: Tools for extracting associations from knowledge graphs and transaction data Presentation: https://docs.google.com/presentation/d/1eX9eTb0C8roy7pYK8li3V5YcWhFByN6is6hz_b62AcA/edit#slide=id.p

Data format(s): RDF, SQL Dumps, CSV

Data license: NA

Website or github URL:

  • JupyterLab notebook with existing code for entity extraction.
  • Demo of our web-based self-service EasyMiner tool for learning rules from single CSVs.
  • Demo of our web-based self-service RDFRules tool for learning rules from RDF KGs.

Slides: https://github.com/corei5/Entity-Based-Document-Classification-on-the-CORD---19-Corpus

Video presentation (recorded May 19, 2020): https://www.youtube.com/watch?v=YcoG9H6r7R0&t=525s

2020-05-19 Marcin Joachimiak, Lawrence Berkeley National

Laboratory: KG-COVID-19, a knowledge graph for COVID-19 response

Contact name and email: Marcin Joachimiak marcinjoachimiak@gmail.com, Lawrence Berkeley National Laboratory, Monarch Initiative, and IDG

Description: Lightweight construction and maintenance of knowledge graphs for COVID-19 drug repurposing efforts.

Data format(s): RDF/TTL http://kg-hub.berkeleybop.io/kg-covid-19.nt.gz

Data license:

Website or github URL: https://covidscholar.org/

Slides: https://lists.w3.org/Archives/Public/www-archive/2020May/att-0002/01-part

Video presentation (recorded May 19, 2020): https://www.youtube.com/watch?v=YcoG9H6r7R0&t=996s

2020-05-19 Michael Liebman, IPQ Analytics LLC: Modeling COVID-19 From the Clinic Back

Contact name and email: Michael Liebman michael.liebman@ipqanalytics.com IPQ Analytics LLC

Description:

Data format(s):

Data license:

Website or github URL:

Slides: Not available

Video presentation (recorded May 19, 2020): https://www.youtube.com/watch?v=YcoG9H6r7R0&t=1503s

2020-05-19 David Booth, Mayo Clinic (consultant): CORD-19-on-FHIR

Contact name and email: David Booth david@dbooth.org, Jiang, Guoqian, M.D., Ph.D. Jiang.Guoqian@mayo.edu, Harold Solbrig solbrig@jhu.edu

Description: We are currently doing NLP to extract Conditions, Medications and Procedures from title and abstract. We plan to expand this to also look at the article full text where available. We are also using Pubtator to extract Species, Gene, Disease, Chemical, CellLine, Mutation and Strain. The result is represented in FHIR RDF.

Data format(s): FHIR RDF

Data license: Our annotations are CC0 licensed, though the CORD-19 dataset has its own licensing.

Website or github URL: https://github.com/fhircat/CORD-19-on-FHIR

Slides: https://tinyurl.com/cord-19-on-fhir

Video presentation (recorded May 19, 2020): https://www.youtube.com/watch?v=YcoG9H6r7R0&t=2218s

2020-05-12 Franck Michel, Université Côte d’Azur, CNRS, Inria: CORD-19 Named Entities Knowledge Graph (CORD19-NEKG)

Contact name and email: Fabien Gandon fabien.gandon@inria.fr, Franck Michel fmichel@i3s.unice.fr

Description: CORD-19 Named Entities Knowledge Graph (CORD19-NEKG) is an RDF dataset describing named entities identified in the scholarly articles of the COVID-19 Open Research Dataset (CORD-19). CORD19-NEKG is an initiative of the Wimmics team. RDF files are generated using Morph-xR2RML, an implementation of the xR2RML mapping language.

Data format(s): RDF Turtle

Data license: CORD19 license for the part of the dataset which is just an RDF version of CORD19 metadata, Open Data Commons Attribution License (https://opendatacommons.org/licenses/by/index.html) for the annotations that we produced.

Website or github URL: https://github.com/Wimmics/cord19-nekg Download: https://github.com/Wimmics/cord19-nekg/tree/master/dataset SPARQL endpoint: https://covid19.i3s.unice.fr/sparql

Paper: https://hal.archives-ouvertes.fr/hal-02939363/document

Slides: https://www.dropbox.com/s/nnyg1o45f9dvimk/20200512%20Covid-on-the-Web%20-%20CORD-19%20semantic%20annotations.pdf?dl=0

Video presentation (recorded May 12, 2020): https://www.youtube.com/watch?v=oUk9PXGM2fY

Project name: COVID-KG

Contact name and email: Gilles Vandewiele gilles.vandewiele@ugent.be, Bram Steenwinckel bram.steenwickel@ugent.be

Description: Transform JSONs & CSV into RDF to create a Knowledge Graph that contains at least the same information as the original dataset, but with extra knowledge in addition in order to facilitate analysis of other researchers.

Data format(s): RDF

Data license: TBD, as open as possible.

Website or github URL: http://github.com/GillesVandewiele/COVID-KG/ & https://www.kaggle.com/group16/covid19-literature-knowledge-graph

Slides:

Video presentation (recorded @@@@):

Project name: CORD-ReDrugS

Contact name and email: Jim McCusker mccusj2@rpi.edu

Description: Enhance ReDrugS [1] to use extracted entities and relations from [2] to repurpose potential therapies.

[1] McCusker JP, Dumontier M, Yan R, He S, Dordick JS, McGuinness DL. 2017. Finding melanoma drugs through a probabilistic knowledge graph. PeerJ Computer Science 3:e106 doi:10.7717/peerj-cs.106 [2] kaggle.com/yitongtseo/cord19-named-entities

Data format(s): RDF, SIO + PROV

Data license: TBD

Website or github URL:

Slides:

Video presentation (recorded ): None

@@@@

Contact name and email:

Description:

Data format(s):

Data license:

Website or github URL:

Draft paper:

Slides:

Video presentation (recorded ):

Clone this wiki locally