Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porter stemmer's output inconsistent with that of reference implementations #139

Closed
alexrudnick opened this issue Jan 17, 2012 · 8 comments

Comments

@alexrudnick
Copy link
Member

I recently used the NLTK's Porter stemmer and discovered some disrepancies between its output and that of another version of the Porter stemmer that I have used. Following up on these discrepancies, I discovered that there may be some problems with the NLTK's implementation.

There are various reference implementations of the Porter stemmer collected by Martin Porter himself here:

http://tartarus.org/~martin/PorterStemmer/

I tried the one for Ruby to sanity check the NLTK (it takes a word from standard out and spits out its stemmed form immediately, also on standard out):

scripts $ ruby porter_stemmer.rb
shiny
shini

I checked this against the NLTK:

scripts $ python
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

        import nltk
        nltk.stem.porter.PorterStemmer().stem_word('shiny')

'shini'

So far, so good. And if I compare these results to the expected results (output.txt) provided on the same page for a sample user dict file (voc.txt), they look right:

scripts $ egrep -n '^shiny$' voc.txt
18333:shiny
scripts $ egrep -n '^shini$' output.txt
18333:shini

But if I systematically compare the NLTK results against those in the expected results file, I see a lot of discrepancies (format: word, proper expected stemming, bad NLTK stemming marked by asterisk):

scripts $ ./show_bad_stemming_in_nltk.py voc.txt output.txt
abbey abbei *abbey
abbeys abbei *abbey
abed ab *abe
absey absei *absey
[...]
wrying wry *wri
yesterday yesterdai *yesterday
yesterdays yesterdai *yesterday
yongrey yongrei *yongrey

Spot checking against the Ruby reference implemenation confirms that the NLTK results are in fact the problem:

scripts $ ruby porter_stemmer.rb
abbey
abbei
abbeys
abbei
abed
ab
absey
absei
wrying
wry
yesterday
yesterdai
yesterdays
yesterdai
yongrey
yongrei

I've attached the full list of words for which the NLTK's Porter stemmer provides unexpected results.

(This bug pertains to version 2.0b9. I suspect it exists in all previous versions, but I haven't confirmed that.)

Migrated from http://code.google.com/p/nltk/issues/detail?id=625


earlier comments

gregg.lind said, at 2011-02-09T20:22:57.000Z:

I would be happy to take this one on (by friday), if no one else wants it. I have used this module a lot.

Gregg Lind

StevenBird1 said, at 2011-02-14T07:14:49.000Z:

Attached is Stuart Robinson's fixes, sent to nltk-dev. This seems to be further edits to the previous version rather than a fresh port of the Ruby version as originally discussed. Before this new version can be incorporated we need a set of test cases added to test/stem.doctest.

@alexrudnick
Copy link
Member Author

I just put the "goodfirstbug" label on this one. The "good first bug" would be adding a bunch of test cases to stem.doctest (https://github.com/nltk/nltk/blob/master/nltk/test/stem.doctest) -- if you feel like merging in Stuart's fixes too, that would be great!

@paulproteus
Copy link

Hey all, especially @alexrudnick , should this be marked as resolved now?

@paulproteus
Copy link

Upon a skim, looks like no, it's not yet resolved. But would like to hear from a maintainer.

ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 22, 2016
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Jan 23, 2016
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
ExplodingCabbage added a commit to ExplodingCabbage/nltk that referenced this issue Sep 4, 2016
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
@stevenbird
Copy link
Member

@paulproteus – finally resolved

@ExplodingCabbage
Copy link
Contributor

Note that the default stemmer behavior as of my PR that Steven just merged is unchanged; you need to explicitly pass mode=PorterStemmer.MARTIN_EXTENSIONS to the PorterStemmer constructor to get behavior that's consistent with Martin's reference implementations (which are themselves inconsistent with Martin's original algorithm).

Arguably, having MARTIN_EXTENSIONS as the default mode (for consistency with the reference implementations) would be better, since users are going to expect something called PorterStemmer to behave out of the box like Martin's reference implementations. The trouble is that that would be a backwards compatibility break, and a nastily non-obvious one; somebody using NLTK's previous implementation who upgrades NLTK could fail to notice for a long time that just a few of their stems have changed, possibly introducing subtle bugs depending upon their use case. Another option is to not have a default value at all, and require every user to read the docs on the different modes and explicitly choose which one to use. This would break backwards-compatibility too, but would do so in an obvious way (just trying to instantiate the stemmer the old way would blow up) so people doing upgrades without reading the release notes wouldn't be caught out. That approach would avoid either of the bad scenarios above, but at the cost of demanding more up-front work from every new user of the stemmer.

None of the options is perfect. I've opted for having NLTK_EXTENSIONS as the default, but there's room to disagree. Anybody have an opinion?

@stevenbird
Copy link
Member

@ExplodingCabbage: I favour your last option, no default, and doing that with the next major release (not the minor release where your work will first appear), with suitable warnings in the release notes. I think it would be good for people to be forced to read the docs and benefit from your work. I'm curious to know what others think.

@josephcc
Copy link

Hi guys,

I've been writing a search engine backend with the default PorterStemmer in nltk, not knowing that it doesn't behave the same way as many other Porter stemmer implementations. Now that I'm working on the front-end using Javascript I'm running into bugs where the frontend and my backend stem words differently. I was wondering what I should look at if i need to recreate nltk's default PorterStemmer behavior in Javascript so I can run it in the browser. I was hoping maybe someone where (maybe @ExplodingCabbage ?) could point me to the right direction.

I really don't have time to re-index everything with the MARTIN_EXTENSIONS mode as it would take weeks to do it...

@ExplodingCabbage
Copy link
Contributor

@josephcc porter.py has basically no dependencies and isn't doing anything profound or magical, just a long series of string manipulations. You'll just want to port the PorterStemmer class to JavaScript, preserving only the if self.mode == self.NLTK_EXTENSIONS branches and throwing away the logic from the others.

Not a 5-minute task or a foolproof one, admittedly, but doable. Also check out PorterTest in https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_stem.py and consider running the test cases there against your JavaScript implementation to check the correctness of your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants