Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement/support/explain topic modelling #42

Open
jbesomi opened this issue Jul 8, 2020 · 4 comments · May be fixed by #163
Open

Implement/support/explain topic modelling #42

jbesomi opened this issue Jul 8, 2020 · 4 comments · May be fixed by #163
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 8, 2020

Goal
Implement topic modeling on Texthero.

Topic modeling
There are mainly two ways to do topic modeling: LSA/LSI (latent semantic indexing) and LDA (Latent Dirichlet allocation). This simple tutorial explains how to implement it in python.

Python implementation
LSA/LSI is just basically TF-IDF + SVD. What's it's important is to understand how to visualize and how to return the topic model information from the function.

Documentation
Other than adding the docstring, it's probably useful to write a "getting started" tutorial on how topic modeling works and how to use Texthero's function.

We will probably want to implement both LSI and LDA, in two? separate functions.

This issue is a work in progress. Any help is very appreciated!

@jbesomi jbesomi added documentation Improvements or additions to documentation help wanted Extra attention is needed enhancement New feature or request labels Jul 8, 2020
@Devilmoon
Copy link

I'm not sure if topic modeling has already been implemented in TextHero, however if it hasn't you might be interested in leveraging Gensim.
I've used it in the past as a novice in topic modeling and it's relatively simple to use. If I remember correctly there is also support for visualization of the results, which seems to be the core of this issue.

Hope this helps!

@jbesomi
Copy link
Owner Author

jbesomi commented Aug 12, 2020

Hey Luca,

No, topic modeling hasn't been implemented in Texthero (with the small h) yet. Gensim is an alternative but we might not need it either if we implement LSA as this the same as callingpca somehow, right?

And yes, the visualization and understanding of the models are for sure an important aspect but that's not the core of the issue. The core of the issue is to understand how to correctly implement topic modeling, which algorithm to pick, see if Gensim is strictly necessary, the function signature and output, and so on.

@juliawabant
Copy link

@jbesomi For LSA or LDA I think Scikit Learn is a good option https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html, https://scikit-learn.org/0.16/modules/generated/sklearn.lda.LDA.html (because you already use it for vectorization, dimension reduction and clustering operations already).
I see it as a start, before integrating more advanced methods such as Correlated Topic modeling or Structural Topic models (with or without covariates - the second being implemented only in R in open-source to my knowledge).
For the rendering of topics, classically people who use Scikit functions seem to define functions like ​​print_topics here https://github.com/amueller/mglearn/blob/master/mglearn/tools.py, but we could imagine something else

@jbesomi
Copy link
Owner Author

jbesomi commented Aug 20, 2020

Thank you Julia! Soon, @henrifroese and @mk2510 will work on this.
And I agree, it's good to start with LSA and LDA, see how it goes, and eventually introduce more advanced methods.

@mk2510 mk2510 linked a pull request Aug 25, 2020 that will close this issue
@mk2510 mk2510 linked a pull request Aug 25, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants