Extract ontology terms referenced from PubMed abstracts as per the MEDLINE/PubMed Baseline Repository by using SciGraph against a set of ontologies.
Using OmniCorp requires the following open source tools:
- Git
- Maven
- Scala and sbt
- wget
On macOS, these can be installed using Homebrew by running
the command: brew install git maven scala sbt wget.
We need to use a specially modified version of SciGraph in order to carry out text annotations.
To install this version locally, run make SciGraph. This will download, compile and install the customized SciGraph we use.
You will then need to run make omnicorp-scigraph to generate the SciGraph instance for the ontologies specified in ontologies.ofn.
Extract ontology terms used in the COVID-19 Open Research Dataset (CORD) as tab-delimited files for further processing in COVID-KOP.
In order to generate OmniCORD output files, you should:
- Update the
ROBOCORD_DATEvariable inMakefile. You can look up the latest CORD-19 release date on their website. - Download the CORD-19 dataset by running
make robocord-download. This will automatically create a directory in therobocord-datasdirectory and download the CORD-19 dataset for$ROBOCORD_DATEinto that directory. - Uncompress the dataset by running
make robocord-data. - Test the extraction program by running
make robocord-test. This will extract data from some articles in order to ensure that the program is working correctly. It will also create a directory in therobocord-outputsdirectory to store the results in. It's usually a good idea to clear therobocord-outputdirectory after running the test and ensuring that the output files look correct. - Use
robocord.jobto attempt to run all the jobs on a SLURM cluster. Any number of jobs can be specified, but values of around 4000 seem to work with. Example:sbatch --array=0-3999 robocord.job. - Use RoboCORDManager to re-run any jobs that failed to complete. You can
use the
--dry-runoption to see what jobs will be executed before they are run. Jobs are executed using therobocord-sbatch.shscript, so modify that if necessary. Example:srun sbt "runMain org.renci.robocord.RoboCORDManager --job-size 20
Currently, we look for terms from the following ontologies:
- Uberon (base) (OWL)
- ChEBI (OWL)
- Cell Ontology (OWL)
- Environment Ontology (OWL)
- Gene Ontology (plus) (OWL)
- NCBITaxon (OWL)
- Relation Ontology (OWL)
- PRotein Ontology (PRO) (OWL)
- Biological Spatial Ontology (OWL)
- Mondo Disease Ontology (OWL)
- The Human Phenotype Ontology (OWL)
- Ontology for Biomedical Investigations (OWL)
- Sequence Ontology (OWL)
- HUGO Gene Nomenclature Committee (OWL)
- Experimental Factor Ontology (OWL)