Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noise topic #346

Open
ronirg opened this issue Aug 17, 2023 · 1 comment
Open

Noise topic #346

ronirg opened this issue Aug 17, 2023 · 1 comment

Comments

@ronirg
Copy link

ronirg commented Aug 17, 2023

Hi
According to the paper:
HDBSCAN assigns a label to each dense cluster of document vectors and assigns a noise
label to all document vectors that are not in a dense cluster.

If a document was assigned to a noise label, will it be in Topic -1 or Topic 0? I cannot find it in the documentation.
I don't get Topic -1 in my experiments.

Thanks

@jacob-bayer
Copy link

I had this question too. I think that topic 0 is noise but I'm not entirely sure. Maybe @ddangelov could weight in. I've found that if you look closely there are lots of other clusters that could be categorized as "noise" as well based on the top words. In my pipeline I look at proportion of topics that are missing the top 5 words from the topic_words, and if they have less than 2 of the top 5 words and confidence below 0.4 I call it an outlier. Then I look at the proportion of outliers for each cluster, and if it's mostly outliers I call it a noise cluster. That works for my data. It might not work for yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants