Skip to content

monarch-initiative/kghub-downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

181 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KG-Hub Downloader

| Documentation | Repository | PyPI |

Overview

This is a configuration based file caching downloader with initial support for http requests & queries against elasticsearch.

Installation

KGHub Downloader is available to install via pip:

pip install kghub-downloader

Configure

The downloader requires a YAML file which contains a list of target URLs to download, and local names to save those downloads.
For an example, see example/download.yaml

Available options are:

  • *url: The URL to download from. Currently supported:

    • http(s)
    • ftp
      • with glob: option to download files with specific extensions (only with ftp as of now and looks recursively).
    • Google Cloud Storage (gs://)
    • Google Drive (gdrive:// or https://drive.google.com/...). The file must be publicly accessible.
    • Amazon AWS S3 bucket (s3://)
    • GitHub Release Assets (git://RepositoryOwner/RepositoryName)

    If the URL includes a name in {CURLY_BRACES}, it will be expanded from environment variables.

  • glob: An optional glob pattern to limit downloading files (FTP only)

  • local_name: The name to save the file as locally

  • tag: A tag to use to filter downloads

  • api: The API to use to download the file. Currently supported: elasticsearch

  • elastic search options

    • query_file: The file containing the query to run against the index
    • index: The elastic search index for query

* Note:
Google Cloud Storage URLs require that you have set up your credentials as described here. You must:

Mirorring local files to Amazon AWS S3 bucket requires the following:

You can also include any secrets like API keys you have set as environment variables using {VARIABLE_NAME}, for example:

---
- url: "https://example.com/myfancyfile.json?key={YOUR_SECRET}"
  localname: myfancyfile.json

Note: YOUR_SECRET MUST as an environment variable, and be sure to include the {curly braces} in the url string.

Usage

Downloader can be used directly in Python or via command line

In Python

from kghub_downloader.download_utils import download_from_yaml

download_from_yaml(yaml_file="download.yaml", output_dir="data")

Command Line

To download files listed in a download.yaml file:

$ downloader [OPTIONS] [YAML_FILE]

Arguments:

  • [YAML_FILE]: List of files to download in YAML format [default: download.yaml]

Options:

  • --output-dir TEXT: Path to output directory [default: .]
  • --ignore-cache / --no-ignore-cache: Ignoring already downloaded files and download again [default: no-ignore-cache]
  • --progress / --no-progress: Show progress for individual downloads [default: progress]
  • --fail-on-error / --no-fail-on-error: Do not attempt to download more files if one raises an error [default: no-fail-on-error]
  • --snippet-only / --no-snippet-only: Only download a snippet of the file. [HTTP(S) resources only. [default: no-snippet-only]
  • --verbose / --no-verbose: Show verbose output [default: no-verbose]
  • --tags TEXT: Optional list of tags to limit downloading to
  • --mirror TEXT: Optional remote storage URL to mirror download to. Supported buckets: Google Cloud Storage

Examples:

$ downloader --output_dir example_output --tags zfin_gene_to_phenotype example.yaml
$ downloader --output_dir example_output --mirror gs://your-bucket/desired/directory

# Note that if your YAML file is named `download.yaml`,
# the argument can be omitted from the CLI call.
$ downloader --output_dir example_output

Development

Install

git clone https://github.com/monarch-initiative/kghub-downloader.git
cd kghub-downloader
poetry install

Run tests

poetry run pytest

NOTE: The tests require gcloud credentials to be set up as described above, using the Monarch github actions service account.

About

Configuration based file caching downloader

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors