Skip to content

Debug BM25Okapi#26

Open
LowinLi wants to merge 1 commit intodorianbrown:masterfrom
LowinLi:master
Open

Debug BM25Okapi#26
LowinLi wants to merge 1 commit intodorianbrown:masterfrom
LowinLi:master

Conversation

@LowinLi
Copy link
Copy Markdown

@LowinLi LowinLi commented Aug 4, 2022

In the "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so the BM25 score also will be negative. So this commit will debug this error.

In "BM25Okapi" function "_calc_idfIf", if average_idf is negative, the eps will be negative, so BM25 score also will be negative. So this commit want be debug this error.
@dorianbrown dorianbrown self-requested a review May 28, 2024 12:19
@dorianbrown
Copy link
Copy Markdown
Owner

dorianbrown commented May 28, 2024

I think I finally found where this motivation came from, namely this section from here:


Please note that the IDF formula listed above has a drawback when using it for terms appearing in more than half of the corpus since the value would come out as negative value, resulting in the overall score to become negative. e.g. if we have 10 documents in the corpus, and the term "the" appeared in 6 of them, its IDF would be log(10−6+0.5/6+0.5)=log(4.5/6.5).

Although we can argue that our implementation should have already removed these frequently appearing words as these words are mostly used to form a complete sentence and carry little meaning of note, different softwares/packages still make different adjustments to prevent a negative score from ever occurring. e.g.

  • Add a 1 to the equation. IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5)
  • For term that resulted in a negative IDF value, swap it with an small positive value, usually denoted as epsilon

@dorianbrown
Copy link
Copy Markdown
Owner

I wonder if it might be more simple to just go with the "smoothed" IDF function IDF(qi)=log(1+N−N(qi)+0.5N(qi)+0.5), which ensures that IDFs are always positive. That way we don't have to do all this checking for negativity stuff.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants