New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porter stemmer's output inconsistent with that of reference implementations #139
Comments
I just put the "goodfirstbug" label on this one. The "good first bug" would be adding a bunch of test cases to stem.doctest (https://github.com/nltk/nltk/blob/master/nltk/test/stem.doctest) -- if you feel like merging in Stuart's fixes too, that would be great! |
Hey all, especially @alexrudnick , should this be marked as resolved now? |
Upon a skim, looks like no, it's not yet resolved. But would like to hear from a maintainer. |
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
@paulproteus – finally resolved |
Note that the default stemmer behavior as of my PR that Steven just merged is unchanged; you need to explicitly pass Arguably, having None of the options is perfect. I've opted for having |
@ExplodingCabbage: I favour your last option, no default, and doing that with the next major release (not the minor release where your work will first appear), with suitable warnings in the release notes. I think it would be good for people to be forced to read the docs and benefit from your work. I'm curious to know what others think. |
Hi guys, I've been writing a search engine backend with the default PorterStemmer in nltk, not knowing that it doesn't behave the same way as many other Porter stemmer implementations. Now that I'm working on the front-end using Javascript I'm running into bugs where the frontend and my backend stem words differently. I was wondering what I should look at if i need to recreate nltk's default PorterStemmer behavior in Javascript so I can run it in the browser. I was hoping maybe someone where (maybe @ExplodingCabbage ?) could point me to the right direction. I really don't have time to re-index everything with the MARTIN_EXTENSIONS mode as it would take weeks to do it... |
@josephcc porter.py has basically no dependencies and isn't doing anything profound or magical, just a long series of string manipulations. You'll just want to port the Not a 5-minute task or a foolproof one, admittedly, but doable. Also check out |
I recently used the NLTK's Porter stemmer and discovered some disrepancies between its output and that of another version of the Porter stemmer that I have used. Following up on these discrepancies, I discovered that there may be some problems with the NLTK's implementation.
There are various reference implementations of the Porter stemmer collected by Martin Porter himself here:
http://tartarus.org/~martin/PorterStemmer/
I tried the one for Ruby to sanity check the NLTK (it takes a word from standard out and spits out its stemmed form immediately, also on standard out):
scripts $ ruby porter_stemmer.rb
shiny
shini
I checked this against the NLTK:
scripts $ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
'shini'
So far, so good. And if I compare these results to the expected results (output.txt) provided on the same page for a sample user dict file (voc.txt), they look right:
scripts $ egrep -n '^shiny$' voc.txt
18333:shiny
scripts $ egrep -n '^shini$' output.txt
18333:shini
But if I systematically compare the NLTK results against those in the expected results file, I see a lot of discrepancies (format: word, proper expected stemming, bad NLTK stemming marked by asterisk):
scripts $ ./show_bad_stemming_in_nltk.py voc.txt output.txt
abbey abbei *abbey
abbeys abbei *abbey
abed ab *abe
absey absei *absey
[...]
wrying wry *wri
yesterday yesterdai *yesterday
yesterdays yesterdai *yesterday
yongrey yongrei *yongrey
Spot checking against the Ruby reference implemenation confirms that the NLTK results are in fact the problem:
scripts $ ruby porter_stemmer.rb
abbey
abbei
abbeys
abbei
abed
ab
absey
absei
wrying
wry
yesterday
yesterdai
yesterdays
yesterdai
yongrey
yongrei
I've attached the full list of words for which the NLTK's Porter stemmer provides unexpected results.
(This bug pertains to version 2.0b9. I suspect it exists in all previous versions, but I haven't confirmed that.)
Migrated from http://code.google.com/p/nltk/issues/detail?id=625
earlier comments
gregg.lind said, at 2011-02-09T20:22:57.000Z:
I would be happy to take this one on (by friday), if no one else wants it. I have used this module a lot.
Gregg Lind
StevenBird1 said, at 2011-02-14T07:14:49.000Z:
Attached is Stuart Robinson's fixes, sent to nltk-dev. This seems to be further edits to the previous version rather than a fresh port of the Ruby version as originally discussed. Before this new version can be incorporated we need a set of test cases added to test/stem.doctest.
The text was updated successfully, but these errors were encountered: