cord-19-search/feeding.md at main · vespa-cloud/cord-19-search

Prerequisites

Python3 installed and two packages to parse and produce Vespa json feed files.

pip3 install pandas numpy

Download the CORD-19 dataset

Download the dataset from ai2-semanticscholar-cord-19.

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
tar xzvf cord-19_2022-06-02.tar.gz && cd 2022-06-02

Process the dataset

python3 /path/to/app/scripts/convert-to-json.py metadata.csv > feed.jsonl

Merge feed file with cord-19 specter embedding. This step expects a feed.jsonl file in the current directory (The file generated by the above run).

tar xzvf cord_19_embeddings.tar.gz
cat cord_19_embeddings_2022-06-02.csv| python3 /path/to/app/scripts/merge.py > merged-feed.jsonl

Feed the data

Use the vespa CLI to feed the data to your Vespa instance:

vespa feed -t <endpoint-url> merged-feed.jsonl

Indexing is CPU intensive as both abstract and title is encoded using ColBERT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prerequisites

Download the CORD-19 dataset

Process the dataset

Feed the data

FilesExpand file tree

feeding.md

Latest commit

History

feeding.md

File metadata and controls

Prerequisites

Download the CORD-19 dataset

Process the dataset

Feed the data