Add BM25F: field-aware BM25 scoring by tavian-dev · Pull Request #54 · dorianbrown/rank_bm25

tavian-dev · 2026-04-02T19:28:45Z

Summary

Implements BM25F as requested in #11 — field-aware BM25 that combines term frequencies across document fields before applying saturation. This avoids the over-estimation that occurs when scoring each field independently and summing, as described in Robertson, Zaragoza & Taylor (2004).

What's new

BM25F class in rank_bm25.py with per-field boost weights (field_weights) and per-field length normalization (field_b)
Unified IDF: a term's document frequency counts a document if the term appears in any field
Combined TF with single saturation: tf_combined = Σ_f (w_f * tf(t,d,f)) / (1 - b_f + b_f * |d_f| / avgdl_f)
Same API as existing variants: get_scores(), get_batch_scores(), get_top_n(), tokenizer support
Documents can have different fields — missing fields are treated as empty
10 tests covering: scoring correctness, title-boost ranking, single-field equivalence with BM25Okapi, batch scores, get_top_n, sparse fields, tokenizer, parameter defaults

Single-field equivalence

When used with a single field, BM25F produces scores identical to BM25Okapi (verified by test).

Usage

from rank_bm25 import BM25F

corpus = [
    {"title": ["machine", "learning"], "body": ["intro", "to", "ML"]},
    {"title": ["deep", "networks"], "body": ["image", "recognition"]},
]

bm25f = BM25F(corpus, field_weights={"title": 2.0, "body": 1.0})
scores = bm25f.get_scores(["machine", "learning"])

Closes #11

Implements BM25F (BM25 with field weights) as requested in dorianbrown#11. BM25F combines term frequencies across document fields before applying saturation, which avoids the over-estimation that occurs when scoring each field independently and summing. This is the standard approach from Robertson, Zaragoza & Taylor (2004). API matches existing BM25 variants (get_scores, get_batch_scores, get_top_n) with additional field_weights and field_b parameters. Includes 10 tests covering scoring, ranking, batch scores, single-field equivalence with BM25Okapi, sparse fields, and tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BM25F: field-aware BM25 scoring#54

Add BM25F: field-aware BM25 scoring#54
tavian-dev wants to merge 1 commit intodorianbrown:masterfrom
tavian-dev:feat/bm25f

tavian-dev commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tavian-dev commented Apr 2, 2026

Summary

What's new

Single-field equivalence

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant