Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] Topic number selection using Cross Validation #332

Open
j-aryan opened this issue Dec 10, 2021 · 1 comment
Open

[question] Topic number selection using Cross Validation #332

j-aryan opened this issue Dec 10, 2021 · 1 comment

Comments

@j-aryan
Copy link

j-aryan commented Dec 10, 2021

Hi,

I'm quite new to topic modelling and I've been working on a particular project with a very large corpus. Performing LDA using gibb-sampler is out of the question (atleast not for cross-validation due to computational constraints). Warp-LDA is the only viable option.

I've been trying to select topic number (k) using various measures. I tried using perplexity but it just seems to keep on decreasing with increasing k and I couldn't identify a clear cut off or elbow. Then I tried coherence measures and I scaled these measures and I've plotted them against each other. Can anyone help me identify what exactly are these measures telling us. Is there any particular k that seems of interest?

Screen Shot 2021-12-10 at 10 49 24 pm

Also, any form of help as to how should I approach this would be fantastic. Below are the values I used for other model parameters:
doc_topic_prior = 0.1, #0.1
topic_word_prior = 0.01

@manuelbickel
Copy link
Contributor

Thank you for your question. As in all tasks regarding the selection of the right number of clusters, topics, etc. there is no single correct answer. Each selection criterion has its own logic - you need to think about whether the logic fits to the perspective you want to display in your results. For example, the different metrics usually are used with different text windows to check the coherence - so if you target coherence within larger text windows pick the according metrics as key selection criterion.

Unfortunately, there is (to my knowledge) only few practical experience in using coherence metrics for selecting a suitable model and especially what implications parameter variations have on the coherence metrics and the resulting interpretations. With practical I mean that we find a model that makes sense from the perspective of qualitative interpretation not from computational accuracy, etc.

  • So, in practice, I would simply use the coherence metrics as indicators that show you potentially interesting models with good performance, indicated by peaks
  • Take the interesting models and check the top 10 or 20 terms of selected topics that are in your area of expertise so you can judge if these topics make sense - try to check thematically similar topics of different models to understand what potential gains you might get by increasing the degree of granularity (i.e. increasing number of topics).
  • So in your case you might, e.g., check the models 110 / 160 / 190 (or maybe 200, but since one metric decreases 190 might be favored)
  • Not for advertising my work but to show you an applied example you might have a look at this article - the situation for selecting a good model was similarly ambiguous why a qualitative check of the models was performed: https://energsustainsoc.biomedcentral.com/articles/10.1186/s13705-019-0226-z/figures/2

I hope this helps. Please do not hesitate to ask further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants