Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word stemming for multi language with snowball, e.g. French, Spanish ... #1062

Open
sunxk opened this issue Apr 15, 2024 · 1 comment
Open
Labels
feature New feature or request pg_search Issue related to `pg_search/` priority-2-medium Medium priority issue

Comments

@sunxk
Copy link

sunxk commented Apr 15, 2024

What
ParadeDB currently supports English stemming through "en_stem".
Snowball is an out-of-the-box library which supports stemming for many languages, like French,Spanish ...
so I suggest this new Feature: stemming for other languages with snowball

Why
ES support different languages stemming through plugin

How
it can be implemented by using snowball-rust
snowball github
snowball demo

@sunxk
Copy link
Author

sunxk commented Apr 15, 2024

I read the implementation code of paradedb's [en_stem] tokenizer and found that it uses tnativy's English stemmer. Since tnativy already offers various stemmers, using tnativy's stemmers might be the optimal choice.

https://github.com/quickwit-oss/tantivy/blob/main/src/tokenizer/stemmer.rs

pub enum Language {
    Arabic,
    Danish,
    Dutch,
    English,
    Finnish,
    French,
    German,
    Greek,
    Hungarian,
    Italian,
    Norwegian,
    Portuguese,
    Romanian,
    Russian,
    Spanish,
    Swedish,
    Tamil,
    Turkish,
}

@philippemnoel philippemnoel added feature New feature or request priority-2-medium Medium priority issue pg_search Issue related to `pg_search/` labels Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request pg_search Issue related to `pg_search/` priority-2-medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

2 participants