Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Manamama · 2024-02-06T17:04:49Z

Problem:

The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.

Solution:

Stop words:

Download the Polish stop words file from e.g. here, rename it to polish.txt, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt).

Stemming:

Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:

from stempel import StempelStemmer
class CallableStemmer:
    def __init__(self, stemmer):
        self.stemmer = stemmer

    def __call__(self, word):
        return self.stemmer.stem(word)

def get_stemmer(language):
    if language == 'pol':
        # Create a StempelStemmer object for Polish
        stemmer_obj = StempelStemmer.default()
        # Wrap it in a CallableStemmer
        return CallableStemmer(stemmer_obj)
    else:
        # For non-Polish languages, use the original Stemmer
        return Stemmer(language)

Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:

def handle_arguments(args, default_input_stream=sys.stdin):
    # ... (other code) ...

    language = args["--language"]
    if args["--stopwords"]:
        stop_words = read_stop_words(args["--stopwords"])
    else:
        stop_words = get_stop_words(language)

    parser = parser(document_content, Tokenizer(language))
    stemmer = get_stemmer(language)

    # ... (other code) ...

This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.

Credit for most of the code: MS Copilot aka Bing

The text was updated successfully, but these errors were encountered:

miso-belica · 2024-02-06T19:03:21Z

Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?

Manamama · 2024-02-10T01:06:54Z

I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.

(BTW, I know next to nothing about such architecture, I have just been hacking here... )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Manamama commented Feb 6, 2024 •

edited by miso-belica

miso-belica commented Feb 6, 2024

Manamama commented Feb 10, 2024

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Comments

Manamama commented Feb 6, 2024 • edited by miso-belica

Problem:

Solution:

Stop words:

Stemming:

miso-belica commented Feb 6, 2024

Manamama commented Feb 10, 2024

Manamama commented Feb 6, 2024 •

edited by miso-belica