Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_corpus always removes words having a frequency below 5 #67

Open
SS159 opened this issue Nov 27, 2023 · 4 comments
Open

build_corpus always removes words having a frequency below 5 #67

SS159 opened this issue Nov 27, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@SS159
Copy link

SS159 commented Nov 27, 2023

In this example, we are looking for mentions of countries, regions or locations on the basis of Abstract and Author and Index Keywords. For this, we are using

corpus = litstudy.build_corpus(docs_springer, ngram_threshold=0.8)

The ngram threshold, even at it's lowest possible value (0.1), returns a list of common words found in the abstract of these papers. However, this frequency does not go below 5 mentions, meaning that references to a number of countries is excluded from the word distribution.

image

Is there a way to reduce the ngram threshold further, or some other method so that we can capture all word mentions, that is, a count of 1 of greater? From this we can then see which refer to geographical areas, and use the filter(like='_', axis=0) function to relevant bigrams (e.g. United States).

Thanks,

S

@SS159
Copy link
Author

SS159 commented Nov 27, 2023

Maybe the simplest solution would be to simply tokenise and print to a .csv all of the words mentioned across the DocumentSet, from which we can refine for mentions of geographical locations?

@stijnh
Copy link
Member

stijnh commented Nov 27, 2023

You can use the min_docs=x option to specify that a word is only valid if it appears in at least x documents. By default min_docs=5.

You can change it by using:

build_corpus(...., min_docs=1)

Which means a word is valid if it appears in at least one document (which is always the case)

@SS159
Copy link
Author

SS159 commented Dec 4, 2023 via email

@stijnh stijnh added the bug Something isn't working label Dec 7, 2023
@stijnh stijnh changed the title Efficacy of Corpus Word Distribution build_corpus always removes words having a frequency below 5 Dec 7, 2023
@stijnh
Copy link
Member

stijnh commented Dec 7, 2023

This looks like a bug. I'll need to look into this. Thanks for reporting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants