Tuning number of topics in LDA K #171

qiushiyan · 2020-05-01T04:59:52Z

Hi Julia! I'm big fan of the tidy text mining book, but it seems it does not have too much emphasis on how to tune the number of topics (K) in a LDA model, or comparisons of LDA of different K. I find the package ldatuning quite helpful . Would you be interested in implement a wrapper or a similar function in the tidytext package?

juliasilge · 2020-05-01T20:58:47Z

Hello @Enixam! 🙌

I have been moving away from using the topicmodels package in favor of the stm package for topic modeling, for a variety of reasons (speed, ease of use, document-level covariates, etc) so I'd be more interested pursuing options in that direction. In 2018 I published this blog post showing how to set up training many models at different values for K, similar to stm's own searchK() function but allowing for more detailed exploration of results. It uses functions from tidytext (the stm tidiers and such) already.

You can also see how I covered this material at rstudio::conf in January.

So this is possible already but does require folks to directly use purrr::map() and friends, along with the functions that calculate metrics such as semantic coherence. There are benefits to that (people get a chance to know what they're dealing with) but perhaps there would be upside to creating something that has less of a barrier to get started. It would work more directly like stm::searchK(), I guess, but return a tibble something like:

## # A tibble: 5 x 10
##       K topic_model exclusivity semantic_cohere… eval_heldout residual   bound
##   <dbl> <list>      <list>      <list>           <list>       <list>     <dbl>
## 1     3 <STM>       <dbl [3]>   <dbl [3]>        <named list… <named … -1.73e6
## 2     4 <STM>       <dbl [4]>   <dbl [4]>        <named list… <named … -1.70e6
## 3     6 <STM>       <dbl [6]>   <dbl [6]>        <named list… <named … -1.69e6
## 4     8 <STM>       <dbl [8]>   <dbl [8]>        <named list… <named … -1.67e6
## 5    10 <STM>       <dbl [10]>  <dbl [10]>       <named list… <named … -1.66e6
## # … with 3 more variables: lfact <dbl>, lbound <dbl>, iterations <dbl>

So to sum up,

I recommend using stm over topicmodels and you can check out the links here for how to find the best value of K using tidy data principles
we can think about adding support for folks to train many models at once (at many values of K) more directly

qiushiyan · 2020-05-02T03:25:34Z

Thanks Julia, as always you posts helped me a lot!

juliasilge added the feature a feature request or enhancement label May 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning number of topics in LDA K #171

Tuning number of topics in LDA K #171

qiushiyan commented May 1, 2020

juliasilge commented May 1, 2020

qiushiyan commented May 2, 2020

Tuning number of topics in LDA K #171

Tuning number of topics in LDA K #171

Comments

qiushiyan commented May 1, 2020

juliasilge commented May 1, 2020

qiushiyan commented May 2, 2020