Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering outliers from Corpus with strange behavior #43

Open
ettoreaquino opened this issue May 11, 2023 · 4 comments
Open

Filtering outliers from Corpus with strange behavior #43

ettoreaquino opened this issue May 11, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@ettoreaquino
Copy link
Contributor

ettoreaquino commented May 11, 2023

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that min_docs and max_docs_ratio are not working as expected.

For example, when forcing outliers to be kept in Corpus by setting min_docs=1 and max_docs_ratio=1, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):

Corpus = litstudy.build_corpus(docs=curtailment_docs,
                               remove_words=None,
                               min_word_length=None,
                               min_docs=1,
                               max_docs_ratio=1,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=None)

Expected behavior

After performing a "dumb filter" on my database, prior to building the Corpus:

curtailment_docs = docs.filter_docs(lambda d: d.abstract is not None)
curtailment_docs = db.filter_docs(lambda d: 'curtailment' in d.abstract.lower())

I was expecting to see 'curtailment' as a "forced outlier".

'curtailment' in [token[1] for token in list(Corpus.dictionary.items())]

But it gives me:

False

Observations

Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs that I'm working with.

@stijnh
Copy link
Member

stijnh commented May 11, 2023

Thanks for using litstudy!

Interesting problem, I'm not sure what is causing this problem. I'll look into this. The lack of proper tests for build_corpus and Corpus do not help, unfortunately :-(. Now might be to time to invest into those.

Look at the code, do you have any feeling on what problem could be? The only thing that look suspicious to me is the call to filter_extremes.

@ettoreaquino
Copy link
Contributor Author

Indeed. It seems that dic.filter_extremes(keep_n=max_tokens) is providing a similar functionality as preprocess_outliers(), so even if the preprocess_outliers() filter is behaving as expected (which I believe it is), once the filter_extremes() is called it overlaps the desired behavior.

I think it would be better to just keep filter_extremes() and incorporating the idea of using min_docs, max_docs and max_tokens in this method. I've checked the documentation and it might work:

Documenation: gensim.corpora.Dictionary.filter_extremes

@ettoreaquino
Copy link
Contributor Author

@stijnh, can you assign this issue to me? I'll look into it and try improve the tests for build_corpus

@stijnh stijnh added the bug Something isn't working label May 11, 2023
@stijnh
Copy link
Member

stijnh commented May 11, 2023

Thanks for looking into this. I was not aware that filter_extremes would also filter tokens based on the number of documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants