Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Analyzer] Unsupervised Clustering #130

Open
shahrukhx01 opened this issue Jun 7, 2021 · 5 comments
Open

[Analyzer] Unsupervised Clustering #130

shahrukhx01 opened this issue Jun 7, 2021 · 5 comments
Assignees
Labels
analyzer enhancement New feature or request

Comments

@shahrukhx01
Copy link
Collaborator

shahrukhx01 commented Jun 7, 2021

@lalitpagaria for getting document vectors we can use this

https://github.com/UKPLab/sentence-transformers

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria following are the steps involved in doing this:

  1. Take n number of text documents and extract sentence/document embeddings using sentence transformers.
  2. Apply unsupervised clustering algorithms, from Sklearn https://scikit-learn.org/stable/modules/clustering.html
  3. Show the actual raw texts in grouped form
  4. Alternatively apply dimensionality reductions and show a visualization like this and link each point of visualization to actual raw text/ maybe show on hover etc.

Hope this would help.

@shahrukhx01 shahrukhx01 changed the title Unsupervised Clustering [Analyzer] Unsupervised Clustering Jul 5, 2021
@lalitpagaria
Copy link
Collaborator

@shahrukhx01 Thank for the information. Let me read them out.
For first version would it possible to build cluster on list of texts.
For example if Obsei fetch 200 reviews, then using these 200 texts can we generate cluster. Then tag each and every reviews based on which cluster it belongs to.
Also it is possible to get multiple categories?

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria that's where topic modelling come into play, to assign categories based on the content of the documents. We have a separate issue for that #131

@lalitpagaria
Copy link
Collaborator

Yeah my bad. Then let's integrate Topic modelling first.

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria could you create a dataset of 200 posts as a csv and host it on Kaggle, I’ll take it up in the first week up August if no ones takes up these two issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants