Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Open
Manamama opened this issue Feb 6, 2024 · 2 comments
Open

Tip: how to make it summarize mid-tail languages, e.g. Polish #204

Manamama opened this issue Feb 6, 2024 · 2 comments

Comments

@Manamama
Copy link

Manamama commented Feb 6, 2024

Problem:

The sumy module uses the nltk package for stemming and stop words, but nltk does not support e.g. the Polish language out of the box.

Solution:

Stop words:

Download the Polish stop words file from e.g. here, rename it to polish.txt, and place it in the sumy stop words directory (~/.local/lib/python3.10/site-packages/sumy/data/stopwords/polish.txt).

Stemming:

Use the pystempel package, which provides a stemmer for the Polish language. Here’s the code:

from stempel import StempelStemmer
class CallableStemmer:
    def __init__(self, stemmer):
        self.stemmer = stemmer

    def __call__(self, word):
        return self.stemmer.stem(word)

def get_stemmer(language):
    if language == 'pol':
        # Create a StempelStemmer object for Polish
        stemmer_obj = StempelStemmer.default()
        # Wrap it in a CallableStemmer
        return CallableStemmer(stemmer_obj)
    else:
        # For non-Polish languages, use the original Stemmer
        return Stemmer(language)

Then in this section, in the handle_arguments function, replace the line where the stemmer is created with a call to get_stemmer:

def handle_arguments(args, default_input_stream=sys.stdin):
    # ... (other code) ...

    language = args["--language"]
    if args["--stopwords"]:
        stop_words = read_stop_words(args["--stopwords"])
    else:
        stop_words = get_stop_words(language)

    parser = parser(document_content, Tokenizer(language))
    stemmer = get_stemmer(language)

    # ... (other code) ...

This way, if the language is Polish, get_stemmer will return a CallableStemmer that wraps a StempelStemmer. For any other language, it will return the original Stemmer.

Credit for most of the code: MS Copilot aka Bing

@miso-belica
Copy link
Owner

Hi, thank you for the issue. If NLTK support is not good enough maybe it would be better to add the support you are suggesting into NLTK. WDYT?

@Manamama
Copy link
Author

I have seen that you have stemmers in your code for Slovak, Greek etc. We had better add Polish there, instead.

(BTW, I know next to nothing about such architecture, I have just been hacking here... )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants