Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sumbasic: KeyError #176

Open
mrx23dot opened this issue Jul 8, 2022 · 5 comments
Open

sumbasic: KeyError #176

mrx23dot opened this issue Jul 8, 2022 · 5 comments
Assignees
Labels

Comments

@mrx23dot
Copy link

mrx23dot commented Jul 8, 2022

sumbasic failed on text:
common.txt

Traceback (most recent call last):
  File "summerisers.py", line 39, in <module>
    summary = " ".join([obj._text for obj in s(parser.document, sentenceCntOut)])
  File "C:\py38_64\lib\site-packages\sumy\summarizers\sum_basic.py", line 27, in __call__
    ratings = self._compute_ratings(sentences)
  File "C:\py38_64\lib\site-packages\sumy\summarizers\sum_basic.py", line 110, in _compute_ratings
    best_sentence_index = self._find_index_of_best_sentence(word_freq, sentences_as_words)
  File "C:\py38_64\lib\site-packages\sumy\summarizers\sum_basic.py", line 92, in _find_index_of_best_sentence
    word_freq_avg = self._compute_average_probability_of_words(word_freq, words)
  File "C:\py38_64\lib\site-packages\sumy\summarizers\sum_basic.py", line 75, in _compute_average_probability_of_words
    word_freq_sum = sum([word_freq_in_doc[w] for w in content_words_in_sentence])
  File "C:\py38_64\lib\site-packages\sumy\summarizers\sum_basic.py", line 75, in <listcomp>
    word_freq_sum = sum([word_freq_in_doc[w] for w in content_words_in_sentence])
KeyError: 'look'

sumy==0.10.0

@miso-belica miso-belica self-assigned this Jul 11, 2022
@slvcsl
Copy link

slvcsl commented Dec 12, 2022

Hi! Any news on this? Thanks a lot for your work!

@mrx23dot
Copy link
Author

Maybe this could help word_freq_in_doc.get(w, 0)
I guess it encounter a word not in dict.

@slvcsl
Copy link

slvcsl commented Dec 12, 2022

My understanding is that it is because _get_content_words_in_sentence and _get_all_content_words_in_doc use a different preprocessing.

I modified _get_all_content_words_in_doc to have the same preprocessing as in:

def _get_all_content_words_in_doc(self, sentences):
        normalized_words = []
        for s in sentences:
            normalized_words += self._normalize_words(s.words)
        normalized_content_words = self._filter_out_stop_words(normalized_words)
        stemmed_normalized_content_words = self._stem_words(normalized_content_words)
        return stemmed_normalized_content_words

It works now, but I still had no time to double-check that this is the correct solution.

@tezer
Copy link

tezer commented Mar 22, 2023

Same error from the docker version:

Traceback (most recent call last):
  File "/usr/local/bin/sumy", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/sumy/__main__.py", line 70, in main
    for sentence in summarizer(parser.document, items_count):
  File "/usr/local/lib/python3.10/site-packages/sumy/summarizers/sum_basic.py", line 27, in __call__
    ratings = self._compute_ratings(sentences)
  File "/usr/local/lib/python3.10/site-packages/sumy/summarizers/sum_basic.py", line 110, in _compute_ratings
    best_sentence_index = self._find_index_of_best_sentence(word_freq, sentences_as_words)
  File "/usr/local/lib/python3.10/site-packages/sumy/summarizers/sum_basic.py", line 92, in _find_index_of_best_sentence
    word_freq_avg = self._compute_average_probability_of_words(word_freq, words)
  File "/usr/local/lib/python3.10/site-packages/sumy/summarizers/sum_basic.py", line 75, in _compute_average_probability_of_words
    word_freq_sum = sum([word_freq_in_doc[w] for w in content_words_in_sentence])
  File "/usr/local/lib/python3.10/site-packages/sumy/summarizers/sum_basic.py", line 75, in <listcomp>
    word_freq_sum = sum([word_freq_in_doc[w] for w in content_words_in_sentence])
KeyError: 'own'

@nefastosaturo
Copy link

Hello there.

I encountered this error too.

The problems are in the two functions in sum_basic.py _get_content_word_in_sentence and _get_all_content_words_in_doc but mostly here

The different steps in those functions creates two different set/list of words due by the stop words list called befor or after normalization or stemmer. Also _get_all_words function calls the stemmer too, creating confusion for the stop word filtering.

So I just changed them like that:

    def _get_all_words_in_doc(self, sentences):
        # return self._stem_words([w for s in sentences for w in s.words])
        return [w for s in sentences for w in s.words]

    def _get_content_words_in_sentence(self, sentence): 
        # firstly normalize
        normalized_words = self._normalize_words(sentence.words) 
        # then filter out stop words
        normalized_content_words = self._filter_out_stop_words(normalized_words)
        # then stem
        stemmed_normalized_content_words = self._stem_words(normalized_content_words)
        return stemmed_normalized_content_words

    def _get_all_content_words_in_doc(self, sentences):
        all_words = self._get_all_words_in_doc(sentences)
        normalized_words = self._normalize_words(all_words)
        normalized_content_words = self._filter_out_stop_words(normalized_words)
        stemmed_normalized_content_words = self._stem_words(normalized_content_words)
        return stemmed_normalized_content_words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants