Skip to content

Latest commit

 

History

History
42 lines (34 loc) · 1.59 KB

File metadata and controls

42 lines (34 loc) · 1.59 KB
#Vespa

Prerequisites

Python3 installed and two packages to parse and produce Vespa json feed files.

pip3 install pandas numpy 

Download the CORD-19 dataset

Download the dataset from ai2-semanticscholar-cord-19.

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
tar xzvf cord-19_2022-06-02.tar.gz && cd 2022-06-02

Process the dataset

python3 /path/to/app/scripts/convert-to-json.py metadata.csv > feed.jsonl

Merge feed file with cord-19 specter embedding. This step expects a feed.jsonl file in the current directory (The file generated by the above run).

tar xzvf cord_19_embeddings.tar.gz
cat cord_19_embeddings_2022-06-02.csv| python3 /path/to/app/scripts/merge.py > merged-feed.jsonl

Feed the data

Use the vespa CLI to feed the data to your Vespa instance:

vespa feed -t <endpoint-url> merged-feed.jsonl

Indexing is CPU intensive as both abstract and title is encoded using ColBERT.