Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s-stemmer deviates from paper? #157

Open
markharwood opened this issue Jun 17, 2019 · 4 comments
Open

s-stemmer deviates from paper? #157

markharwood opened this issue Jun 17, 2019 · 4 comments

Comments

@markharwood
Copy link

I see that bees doesn't stem to bee and tomatoes doesn't stem to tomato.

Is this misinterpreting the logic in the original paper?
I ask because I work on elasticsearch and discovered that we have a similar issue. See elastic/elasticsearch#42892 (comment) for my notes on the confusion.

@Yomguithereal
Copy link
Owner

Hello @markharwood. That's entirely possible because I think I wrote my implementation reading Lucene's one, which should be the same as ES is using. Do you, by chance, have a link to, or the pdf, of the original article? As stated here I only could find a paper referencing the algorithm and explaining its broad intentions.

@markharwood
Copy link
Author

No, I only saw the same paper as you. I've just tried sending an email to the original paper author - I'm sure she'd like to see her algorithm implemented correctly too.

@markharwood
Copy link
Author

markharwood commented Jul 9, 2019

I heard back from Donna, the paper author. She agrees the bees/employees words should fall into rule 3 and remove the S. However that logic would make rule 2 redundant.
Rule 1 also has some weird looking exceptions which don't appear to relate to any common English words that I know of.

The origins of the S-stemmer algorithm appear to be lost in time - Donna didn't author it and suggested the logic may be connected to the SMART system from wayback when.

Rather than trying to resolve that I've been working on an alternative plural stemmer for elasticsearch here

@Yomguithereal
Copy link
Owner

Cool. Can you tell me when you feel your stemmer is done and when it's merged into ES and I will be able to replicate here if you want. Or feel free to open a PR if you want to do it also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants