Hello!
Feature Request overview
- Many example scripts use
http_get, while we can more smoothly load that data with datasets
Details
Many example scripts and some tests rely on http_get to download e.g. https://sbert.net/datasets/stsbenchmark.tsv.gz / https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz / askubuntu / TREC, etc., while this data is often also easily accessible on Hugging Face. We should be able to simplify a lot of these scripts considerably with datasets (and perhaps also Dataset.map/Dataset.filter etc.).
For example
|
sts_dataset_path = "datasets/stsbenchmark.tsv.gz" |
|
if not os.path.exists(sts_dataset_path): |
|
util.http_get("https://sbert.net/datasets/stsbenchmark.tsv.gz", sts_dataset_path) |
When we can follow the steps I already took in 548e463
to update these
|
# 2. Load the STSB dataset: https://huggingface.co/datasets/sentence-transformers/stsb |
|
train_dataset = load_dataset("sentence-transformers/stsb", split="train") |
|
eval_dataset = load_dataset("sentence-transformers/stsb", split="validation") |
|
test_dataset = load_dataset("sentence-transformers/stsb", split="test") |
Hello!
Feature Request overview
http_get, while we can more smoothly load that data withdatasetsDetails
Many example scripts and some tests rely on
http_getto download e.g.https://sbert.net/datasets/stsbenchmark.tsv.gz/https://msmarco.z22.web.core.windows.net/msmarcoranking/collection.tar.gz/ askubuntu / TREC, etc., while this data is often also easily accessible on Hugging Face. We should be able to simplify a lot of these scripts considerably withdatasets(and perhaps alsoDataset.map/Dataset.filteretc.).For example
sentence-transformers/tests/test_train_stsb.py
Lines 35 to 37 in 5bd3e61
When we can follow the steps I already took in 548e463
to update these
sentence-transformers/examples/sentence_transformer/training/sts/training_stsbenchmark.py
Lines 40 to 43 in 5bd3e61