Closed
Conversation
Collaborator
|
https://github.com/allenai/ir_datasets/ can we use this python lib to manage the datasets for fts? it contains most ir datasets! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Denise2004 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
b83d438 to
064522f
Compare
064522f to
e4ca67c
Compare
Collaborator
|
close for dup with #705 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds complete full-text search feature support to VectorDBBench, enabling the benchmarking tool to evaluate Milvus's BM25 full-text search performance. This feature is based on the MS MARCO dataset and supports test cases of various scales.
Main Achievements
Adapted 3 FTS Performance Test Cases: Support for 100K, 5M, and 8.8M scale MS MARCO datasets. Only submitting 100K version here.
Complete FTS Dataset Management: Support for reading, parsing, and batch processing of TSV format data files
Milvus FTS Client Integration: Implement full-text document insertion and BM25 search functionality.
FTS-Specific Evaluation Metrics: Added calculation of Recall@K, NDCG@K, MRR and other metrics
Frontend Interface Support: Added FTS test case configuration and parameter settings in the Web UI
Core Features
1. Dataset Support
FtsDatasetManagerto manage MS MARCO datasets2. Milvus FTS Integration
insert_fulltext()method: Support batch insertion of full-text documentssearch_fulltext()method: Full-text search based on BM25 algorithm3. Test Execution Engine
SerialFtsInsertRunner: FTS document insertion executorSerialSearchRunnerandMultiProcessingSearchRunner: Support FTS search testing4. Evaluation System
calc_recall_fts()andcalc_ndcg_fts()functions