Skip to content

Add BM25F: field-aware BM25 scoring#54

Open
tavian-dev wants to merge 1 commit intodorianbrown:masterfrom
tavian-dev:feat/bm25f
Open

Add BM25F: field-aware BM25 scoring#54
tavian-dev wants to merge 1 commit intodorianbrown:masterfrom
tavian-dev:feat/bm25f

Conversation

@tavian-dev
Copy link
Copy Markdown

Summary

Implements BM25F as requested in #11 — field-aware BM25 that combines term frequencies across document fields before applying saturation. This avoids the over-estimation that occurs when scoring each field independently and summing, as described in Robertson, Zaragoza & Taylor (2004).

What's new

  • BM25F class in rank_bm25.py with per-field boost weights (field_weights) and per-field length normalization (field_b)
  • Unified IDF: a term's document frequency counts a document if the term appears in any field
  • Combined TF with single saturation: tf_combined = Σ_f (w_f * tf(t,d,f)) / (1 - b_f + b_f * |d_f| / avgdl_f)
  • Same API as existing variants: get_scores(), get_batch_scores(), get_top_n(), tokenizer support
  • Documents can have different fields — missing fields are treated as empty
  • 10 tests covering: scoring correctness, title-boost ranking, single-field equivalence with BM25Okapi, batch scores, get_top_n, sparse fields, tokenizer, parameter defaults

Single-field equivalence

When used with a single field, BM25F produces scores identical to BM25Okapi (verified by test).

Usage

from rank_bm25 import BM25F

corpus = [
    {"title": ["machine", "learning"], "body": ["intro", "to", "ML"]},
    {"title": ["deep", "networks"], "body": ["image", "recognition"]},
]

bm25f = BM25F(corpus, field_weights={"title": 2.0, "body": 1.0})
scores = bm25f.get_scores(["machine", "learning"])

Closes #11

Implements BM25F (BM25 with field weights) as requested in dorianbrown#11.

BM25F combines term frequencies across document fields before applying
saturation, which avoids the over-estimation that occurs when scoring
each field independently and summing. This is the standard approach
from Robertson, Zaragoza & Taylor (2004).

API matches existing BM25 variants (get_scores, get_batch_scores,
get_top_n) with additional field_weights and field_b parameters.

Includes 10 tests covering scoring, ranking, batch scores,
single-field equivalence with BM25Okapi, sparse fields, and tokenizer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add BM25F function

1 participant