Filtering outliers from Corpus with strange behavior #43

ettoreaquino · 2023-05-11T09:40:35Z

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that min_docs and max_docs_ratio are not working as expected.

For example, when forcing outliers to be kept in Corpus by setting min_docs=1 and max_docs_ratio=1, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):

Corpus = litstudy.build_corpus(docs=curtailment_docs,
                               remove_words=None,
                               min_word_length=None,
                               min_docs=1,
                               max_docs_ratio=1,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=None)

Expected behavior

After performing a "dumb filter" on my database, prior to building the Corpus:

curtailment_docs = docs.filter_docs(lambda d: d.abstract is not None)
curtailment_docs = db.filter_docs(lambda d: 'curtailment' in d.abstract.lower())

I was expecting to see 'curtailment' as a "forced outlier".

'curtailment' in [token[1] for token in list(Corpus.dictionary.items())]

But it gives me:

False

Observations

Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs that I'm working with.

The text was updated successfully, but these errors were encountered:

stijnh · 2023-05-11T09:49:20Z

Thanks for using litstudy!

Interesting problem, I'm not sure what is causing this problem. I'll look into this. The lack of proper tests for build_corpus and Corpus do not help, unfortunately :-(. Now might be to time to invest into those.

Look at the code, do you have any feeling on what problem could be? The only thing that look suspicious to me is the call to filter_extremes.

ettoreaquino · 2023-05-11T10:56:39Z

Indeed. It seems that dic.filter_extremes(keep_n=max_tokens) is providing a similar functionality as preprocess_outliers(), so even if the preprocess_outliers() filter is behaving as expected (which I believe it is), once the filter_extremes() is called it overlaps the desired behavior.

I think it would be better to just keep filter_extremes() and incorporating the idea of using min_docs, max_docs and max_tokens in this method. I've checked the documentation and it might work:

Documenation: gensim.corpora.Dictionary.filter_extremes

ettoreaquino · 2023-05-11T12:12:11Z

@stijnh, can you assign this issue to me? I'll look into it and try improve the tests for build_corpus

stijnh · 2023-05-11T13:36:25Z

Thanks for looking into this. I was not aware that filter_extremes would also filter tokens based on the number of documents

stijnh assigned ettoreaquino May 11, 2023

stijnh added the bug Something isn't working label May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering outliers from Corpus with strange behavior #43

Filtering outliers from Corpus with strange behavior #43

ettoreaquino commented May 11, 2023 •

edited

stijnh commented May 11, 2023 •

edited

ettoreaquino commented May 11, 2023

ettoreaquino commented May 11, 2023

stijnh commented May 11, 2023

Filtering outliers from Corpus with strange behavior #43

Filtering outliers from Corpus with strange behavior #43

Comments

ettoreaquino commented May 11, 2023 • edited

Description

Expected behavior

Observations

stijnh commented May 11, 2023 • edited

ettoreaquino commented May 11, 2023

ettoreaquino commented May 11, 2023

stijnh commented May 11, 2023

ettoreaquino commented May 11, 2023 •

edited

stijnh commented May 11, 2023 •

edited