Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic Modelling and Visualization #163

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Aug 25, 2020

This PR implements support for Topic Modelling in Texthero (see #42). Maybe see the showcasing notebook first before reading this.

Overview

We implement 5 new functions:

  • lda (Latent Dirichlet Allocation (LDA))
  • truncatedSVD (truncated Singular Value Decomposition), same as Latent Semantic Analysis / Indexing (LSA / LSI)
  • visualize_topics to visualize topics with pyLDAvis
  • topics_from_topic_model to get topics for documents after using lda/tSVD
  • top_words_per_document to get the most relevant words ("keywords") for every document
  • top_words_per_topic to get the most relevant words for every topic (=cluster)

There are now two main ways for users to find, visualize, and understand the topics in their datasets:

  1. tfidf/count/term_frequency [optional: -> flair embeddings] [optional: -> dimensionality reduction, tSVD] -> clustering. The clusters are now understood as "topics". Users can now use e.g. visualize_topics(s_tfidf, s_clustered) to see their clusters/topics visualized, and they can do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every cluster.

  2. tfidf/count/term_frequency -> lda. Users can now use e.g. visualize_topics(s_tfidf, s_lda) to see the topics found by lda visualized, and they can do s_topics = topics_from_topic_model(s_tfidf, s_lda) to get the best-matching topic for every document and then do top_words_per_topic(s_tfidf, s_clustered) to get the most relevant words for every topic.

The new functions in detail (excerpts of their docstrings + some explanations)

LDA

lda(s: Union[VectorSeries, DocumentTermDF], n_components=10, max_iter=10, random_state=None, n_jobs=-1) -> VectorSeries

This is a very straightforward implementation of sklearn's LDA. LDA returns a matrix with dimensions number of documents X number of topics ("document-topic-matrix") that relates documents to topics (document_topic_matrix[i][j] says how strongly document i belongs to matrix j (unnormalized!)).

truncatedSVD

Like e.g. PCA; see this for an example of using the sklearn implementation. As we can see, it'll be used like e.g. PCA.

visualize_topics

visualize_topics(s_document_term: DocumentTermDF, s_document_topic: Union[VectorSeries, CategorySeries (issue 164)], show_in_new_window=False, return_figure=False)

This is our coolest new function; it visualizes the topics interactively. It builds upon pyLDAvis and is extended in such a way as to allow us to not be restricted to LDA to profit from the great visualization interface.

The first input is the output of tfidf/term_frequency/count. This gives us a relation (/matrix) document->terms. The second input has to give us a relation document->topic. This can either be the output of one of our clustering functions (then the clusters are the topics, so we have one topic per document; we create a document-topic-matrix from that) or of lda (then as described above in lda, we have a document-topic-matrix right there already).

From those two relations (documents->topics, documents->terms), the function calculates a distribution of
documents to topics, and a distribution of topics to terms (similarly to pyLDAvis internally, but we extend it for clustering input and not only LDA). These distributions are then passed to pyLDAvis, which visualizes them. The function visualize_topics and its helper functions are really well documented 🥈 , so it should be clear what's happening in the code after reading this.

topics_from_topic_model

topics_from_topic_model(s_document_topic: VectorSeries) -> CategorySeries (issue 164)

Find the topics from a topic model. Input has to be output of one of lda, truncatedSVD, so the output of one of Texthero's Topic Modelling functions that returns a relation between documents and topics (the document_topic_matrix). The function uses the given relation of documents to topics to calculate the best-matching topic per document and returns a Series with the topic IDs.

The document_topic_matrix relates documents to topics, so it shows for each document (so for each row), how
strongly that document belongs to a topic. So document_topic_matrix[X][Y] = how strongly document X belongs to topic Y (as explained above). We use np.argmax to find the index of the topic that a document belongs most strongly to for each document (so for each row). E.g. when the first row of the document_topic_matrix is
[0.2, 0.1, 0.2, 0.5], then the first document will be put into topic / cluster 3 as the third entry (counting from 0) is
the best matching topic.

We return a CategorySeries (see #164), so a series with a ID per document describing to which cluster it belongs.

top_words_per_topic

top_words_per_topic(s_document_term: DocumentTermDF, s_clusters: CategorySeries, n_words=5) -> TokenSeries

The function takes as first input a DocumentTermDF (so output of tfidf, term_frequency, count) and as second input a CategorySeries (see #164) that assigns a topic/cluster to every document (so output of a clustering function or topics_from_topic_model).

The function uses the given clustering from the second input, which relates documents to topics. The first input relates documents to terms. From those two relations (documents->topics, documents->terms), the function calculates a distribution of documents to topics, and a distribution of topics to terms. These distributions are used to find the most relevant terms per topic through pyLDAvis again (see their original paper on how they find relevant terms).

top_words_per_document

top_words_per_document(s_document_term: DocumentTermDF, n_words=5) -> TokenSeries

Very similar to top_words_per_topic, only that every document is treated as one topic/cluster so pyLDAvis finds relevant words ("keywords") that are characteristic for a document.

Showcase / Example

See this notebook for examples for this PR

mk2510 and others added 30 commits August 18, 2020 22:06
suport MultiIndex as function parameter

returns MultiIndex, where Representation was returned

* missing: correct test


Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types


Co-authored-by: Henri Froese <hf2000510@gmail.com>
Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
*missing: some test


Co-authored-by: Henri Froese <hf2000510@gmail.com>
missing tests
…pic_model

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 5, 2020

@jbesomi we created a short notebook, where we display the functionality of those two pipelines. 🐰 We think those two will be the main use cases of the implemented functions. The third 🥉 use case, finding relevant words per topic and including them in a data frame is just a version of the second algorithm, but with the documents clustered with a clustering algorithm like kmeans or assigned to a topic with LSA/LDA.

When those functions will be ready to merge, we will prepare an exhaustive tutorial for the user to introduce them to the Topic Modeling 💯

texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
@jbesomi jbesomi marked this pull request as draft September 8, 2020 11:23
@jbesomi
Copy link
Owner

jbesomi commented Sep 8, 2020

For now, reviewed only lda, see comments below

@henrifroese
Copy link
Collaborator

Thanks for the review! As I commented above, we'll have to go through this again anyway once #156 is merged 🙏 .

@jbesomi
Copy link
Owner

jbesomi commented Sep 14, 2020

#156 has been merged; can you please go through it again? => let's wait for #157 to be merged

PCoA is implemented in a sub-optimal way in the pyLDAvis library. We change this (by adding 1 character to their code).

Co-authored-by: Maximilian Krahn <maximilian.krahn@icloud.com>
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 22, 2020

we also have updated this branch, so it is now sourced from the master 🥳 it is now ready to be reviewed or merged 🦀 🤞

@mk2510 mk2510 marked this pull request as ready for review September 22, 2020 10:09
@kepler
Copy link

kepler commented Apr 1, 2021

This would be a very useful feature. Any pending blockers or any expected date for merging and releasing?

@jbesomi
Copy link
Owner

jbesomi commented Apr 4, 2021

Hey @kepler
Yes, the plan is to merge this PR soon. But first, the idea is to release a new version with the HeroSeries (to introduce and explain the concept). After that, we will be able to merge this one.

The remaining step for the HeroSeries is:

  1. make sure each function makes correctly use of the HeroSeries and test it (TODO, we need to open an issue)
  2. finalize the documentation for the HeroSeries (Update README.md #117, Update getting_started.md #118, Getting started: Kind of Series (HeroSeries) #135)

@jbesomi jbesomi mentioned this pull request Apr 16, 2021
@bcornet1
Copy link

Hello, do you have any news on this topic or when it will be released? Thanks :)

@havardl
Copy link

havardl commented Mar 17, 2022

Hi, is there any news on when this PR will be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement/support/explain topic modelling
6 participants