Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvement for bind_tf_idf #237

Open
sometimesabird opened this issue Jun 26, 2023 · 5 comments
Open

Speed improvement for bind_tf_idf #237

sometimesabird opened this issue Jun 26, 2023 · 5 comments
Labels
feature a feature request or enhancement

Comments

@sometimesabird
Copy link

Hey, guys, I noticed that bind_tf_idf() doesn't really use dplyr, which has better performance relative to base R. I had a 30% improvement in speed for getting tfidf for a corpus of 100,000 tweets using this code:

corpus %>%
  group_by(TextID, word) %>% 
  count() %>% 
  group_by(TextID) %>% 
  mutate(tf = n / sum(n)) %>% 
  group_by(word) %>% 
  mutate(Documents = n()) %>% 
  ungroup() %>% 
  mutate(idf = log(length(unique(TextID)) / Documents),
         tf_idf = tf * idf)
@juliasilge
Copy link
Owner

There have been some really big improvements in vctrs and dplyr since this code was originally written, so it would be a great idea for us to update it. 👍

@juliasilge juliasilge added the feature a feature request or enhancement label Jul 2, 2023
@juliasilge
Copy link
Owner

juliasilge commented Jul 3, 2023

I started working on this today but I noticed that using dplyr more directly is slower in the cases I have tested out:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)

book_words <- austen_books() |> 
  unnest_tokens(word, text) |> 
  count(book, word, sort = TRUE)

bench::mark(
  current_tidytext = bind_tf_idf(book_words, word, book, n),
  use_dplyr = book_words |> 
    mutate(tf = n / sum(n), .by = "book") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(book) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> # A tibble: 2 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext   28.6ms   29.1ms      33.8    9.13MB     7.25
#> 2 use_dplyr          46.3ms   46.7ms      21.4    6.24MB    25.7

Created on 2023-07-03 with reprex v2.0.2

Let me find a convenient dataset with a lot more short texts to compare.

@juliasilge
Copy link
Owner

Hmmm, it still looks faster to keep as is, even with shorter and more numerous documents:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidytext)

word_counts <- modeldata::tate_text |> 
  unnest_tokens(word, title) |> 
  count(id, word, sort = TRUE)

bench::mark(
  current_tidytext = bind_tf_idf(word_counts, word, id, n),
  use_dplyr = word_counts |> 
    mutate(tf = n / sum(n), .by = "id") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(id) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> # A tibble: 2 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext   19.8ms   20.3ms      49.2    4.27MB     6.71
#> 2 use_dplyr          22.6ms   22.7ms      44.0    2.83MB   249.

Created on 2023-07-03 with reprex v2.0.2

@sometimesabird can you show me an example where this would be faster?

@etiennebacher
Copy link

Hi, I came across this issue a bit randomly but thought I'd give it a try. I use the text preparations steps described here to get a large enough word count. I added the package collapse in the benchmark. It's a super-fast, dependency-free package that is built for speed and to work well with dplyr syntax, so I'm just putting it here if you want to consider it:

suppressPackageStartupMessages({
  library(dplyr)
  library(collapse)
  library(sotu)
  library(readtext)
  library(tidytext)
})

file_paths <- sotu_dir()
sotu_texts <- readtext(file_paths)

sotu_whole <- 
  sotu_meta %>%  
  arrange(president) %>% # sort metadata
  bind_cols(sotu_texts) %>% # combine with texts
  as_tibble()

tidy_sotu <- sotu_whole %>%
  unnest_tokens(word, text) |> 
  fcount(doc_id, word, sort = TRUE, name = "n")


bench::mark(
  current_tidytext = bind_tf_idf(tidy_sotu, word, doc_id, n),
  
  use_collapse = tidy_sotu |> 
    fgroup_by(doc_id) |> 
    fmutate(tf = n / sum(n)) %>% 
    fungroup() |> 
    fcount(word, name = "doc_total", add = TRUE) |> 
    fmutate(idf = log(n_distinct(doc_id) / doc_total),
            tf_idf = tf * idf) |>
    fselect(-doc_total),
  
  use_dplyr = tidy_sotu |> 
    mutate(tf = n / sum(n), .by = "doc_id") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(doc_id) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext  415.5ms  421.6ms      2.37    73.5MB     2.37
#> 2 use_collapse       29.2ms   35.8ms     22.7     27.5MB    11.3 
#> 3 use_dplyr         331.5ms  351.5ms      2.84    46.4MB     2.84

@juliasilge
Copy link
Owner

Thanks @etiennebacher! I also should try out using vctrs directly for comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants