Python3 installed and two packages to parse and produce Vespa json feed files.
pip3 install pandas numpy
Download the dataset from ai2-semanticscholar-cord-19.
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
tar xzvf cord-19_2022-06-02.tar.gz && cd 2022-06-02
python3 /path/to/app/scripts/convert-to-json.py metadata.csv > feed.jsonl
Merge feed file with cord-19 specter embedding. This step expects a feed.jsonl file in
the current directory (The file generated by the above run).
tar xzvf cord_19_embeddings.tar.gz
cat cord_19_embeddings_2022-06-02.csv| python3 /path/to/app/scripts/merge.py > merged-feed.jsonl
Use the vespa CLI to feed the data to your Vespa instance:
vespa feed -t <endpoint-url> merged-feed.jsonl
Indexing is CPU intensive as both abstract and title is encoded using ColBERT.