Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to add BM25 Score #57

Open
OmaymaS opened this issue Apr 23, 2017 · 8 comments
Open

Suggestion to add BM25 Score #57

OmaymaS opened this issue Apr 23, 2017 · 8 comments
Labels
feature a feature request or enhancement

Comments

@OmaymaS
Copy link

OmaymaS commented Apr 23, 2017

I suggest to add a function to bind BM25 score (which is based on a probabilistic term weighting model). It is useful in some cases as it gives control over:

  • Term frequency saturation
  • Document/Field length normalization

It is commonly used as a ranking function by search engines.

I implemented a function bind_bm25 in the forked repo HERE

# bind_bm25 is given bare names -------------------

bind_bm25 <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  bind_bm25_(tbl,
               col_name(substitute(term_col)),
               col_name(substitute(document_col)),
               col_name(substitute(n_col)),
               k = k,
               b = b)
}

# bind_bm25_ is given strings -------------------------

bind_bm25_ <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  terms <- tbl[[term_col]]
  documents <- tbl[[document_col]]
  n <- tbl[[n_col]]

  doc_totals <- tapply(n, documents, sum)
  avg_dl <- mean(doc_totals)

  idf <- log(length(doc_totals) / table(terms))

  tbl$tf_bm25 <- ((k+1)*n)/(n+(k*((1-b)+b*(as.numeric(doc_totals[documents])/avg_dl))))
  tbl$idf <- as.numeric(idf[terms])
  tbl$bm25 <- tbl$tf_bm25 * tbl$idf

  tbl
}
@OmaymaS OmaymaS changed the title Request: Suggestion to add BM25 Score Suggestion to add BM25 Score Apr 23, 2017
@Ironholds
Copy link

This seems super useful! I might suggest adding substitution for lazy evaluation (so it matches the rest of the code) and experimenting around with S3 methods in case this falls over for data.tables, but I'm happy to do that work and fully integrate it if @juliasilge and/or @dgrtwo give a thumbs up to the general ticket scope?

@OmaymaS
Copy link
Author

OmaymaS commented May 1, 2017

Thanks
Just need to make sure what issues could appear with data.table.
I think it will work properly like bind_tf_idf, or you meant sth else?

@Ironholds
Copy link

Oh, just the indices-based selection can sometimes get gnarly since it behaves somewhat differently. It'll probably be fine, but I'll check to make sure once David/Julia sign off (hinthint)

@juliasilge
Copy link
Owner

We are working on getting broken things fixed, cleaned up, etc for our 0.1.3 release, but let's come back and get this implemented for tidytext 0.1.4!

@juliasilge juliasilge added this to the v0.1.4 milestone May 2, 2017
@Ironholds
Copy link

If that's the goal, I'll add it to the to-do! Anything I can do to help with the fixing, cleanup, etc?

@juliasilge juliasilge modified the milestones: v0.1.4, v0.1.5 Sep 29, 2017
@juliasilge juliasilge removed this from the v0.1.5 milestone Apr 3, 2020
@juliasilge juliasilge added the feature a feature request or enhancement label Apr 3, 2020
@jl5000
Copy link

jl5000 commented Jun 24, 2021

Is there any update on this? It would be good to have a TF-IDF alternative.

@juliasilge
Copy link
Owner

No recent work on this, but if you are looking for an alternative to tf-idf that may fit your needs better, check out weighted log odds with the tidylo package.

@jl5000
Copy link

jl5000 commented Jun 24, 2021

That's very helpful, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants