Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning number of topics in LDA K #171

Open
qiushiyan opened this issue May 1, 2020 · 2 comments
Open

Tuning number of topics in LDA K #171

qiushiyan opened this issue May 1, 2020 · 2 comments
Labels
feature a feature request or enhancement

Comments

@qiushiyan
Copy link

Hi Julia! I'm big fan of the tidy text mining book, but it seems it does not have too much emphasis on how to tune the number of topics (K) in a LDA model, or comparisons of LDA of different K. I find the package ldatuning quite helpful . Would you be interested in implement a wrapper or a similar function in the tidytext package?

@juliasilge
Copy link
Owner

Hello @Enixam! 🙌

I have been moving away from using the topicmodels package in favor of the stm package for topic modeling, for a variety of reasons (speed, ease of use, document-level covariates, etc) so I'd be more interested pursuing options in that direction. In 2018 I published this blog post showing how to set up training many models at different values for K, similar to stm's own searchK() function but allowing for more detailed exploration of results. It uses functions from tidytext (the stm tidiers and such) already.

image

You can also see how I covered this material at rstudio::conf in January.

So this is possible already but does require folks to directly use purrr::map() and friends, along with the functions that calculate metrics such as semantic coherence. There are benefits to that (people get a chance to know what they're dealing with) but perhaps there would be upside to creating something that has less of a barrier to get started. It would work more directly like stm::searchK(), I guess, but return a tibble something like:

## # A tibble: 5 x 10
##       K topic_model exclusivity semantic_cohere… eval_heldout residual   bound
##   <dbl> <list>      <list>      <list>           <list>       <list>     <dbl>
## 1     3 <STM>       <dbl [3]>   <dbl [3]>        <named list… <named … -1.73e6
## 2     4 <STM>       <dbl [4]>   <dbl [4]>        <named list… <named … -1.70e6
## 3     6 <STM>       <dbl [6]>   <dbl [6]>        <named list… <named … -1.69e6
## 4     8 <STM>       <dbl [8]>   <dbl [8]>        <named list… <named … -1.67e6
## 5    10 <STM>       <dbl [10]>  <dbl [10]>       <named list… <named … -1.66e6
## # … with 3 more variables: lfact <dbl>, lbound <dbl>, iterations <dbl>

So to sum up,

  • I recommend using stm over topicmodels and you can check out the links here for how to find the best value of K using tidy data principles
  • we can think about adding support for folks to train many models at once (at many values of K) more directly

@juliasilge juliasilge added the feature a feature request or enhancement label May 1, 2020
@qiushiyan
Copy link
Author

Thanks Julia, as always you posts helped me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants