Add BM25F: field-aware BM25 scoring#54
Open
tavian-dev wants to merge 1 commit intodorianbrown:masterfrom
Open
Conversation
Implements BM25F (BM25 with field weights) as requested in dorianbrown#11. BM25F combines term frequencies across document fields before applying saturation, which avoids the over-estimation that occurs when scoring each field independently and summing. This is the standard approach from Robertson, Zaragoza & Taylor (2004). API matches existing BM25 variants (get_scores, get_batch_scores, get_top_n) with additional field_weights and field_b parameters. Includes 10 tests covering scoring, ranking, batch scores, single-field equivalence with BM25Okapi, sparse fields, and tokenizer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements
BM25Fas requested in #11 — field-aware BM25 that combines term frequencies across document fields before applying saturation. This avoids the over-estimation that occurs when scoring each field independently and summing, as described in Robertson, Zaragoza & Taylor (2004).What's new
BM25Fclass inrank_bm25.pywith per-field boost weights (field_weights) and per-field length normalization (field_b)tf_combined = Σ_f (w_f * tf(t,d,f)) / (1 - b_f + b_f * |d_f| / avgdl_f)get_scores(),get_batch_scores(),get_top_n(),tokenizersupportBM25Okapi, batch scores,get_top_n, sparse fields, tokenizer, parameter defaultsSingle-field equivalence
When used with a single field,
BM25Fproduces scores identical toBM25Okapi(verified by test).Usage
Closes #11