Skip to content

Commit

Permalink
Rewrite porter.py
Browse files Browse the repository at this point in the history
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
  • Loading branch information
ExplodingCabbage committed Jan 23, 2016
1 parent 57511eb commit 6ad0304
Show file tree
Hide file tree
Showing 7 changed files with 94,797 additions and 587 deletions.
8 changes: 8 additions & 0 deletions AUTHORS.md
Expand Up @@ -189,3 +189,11 @@
- Sergio Oller
- Will Monroe
- Elijah Rippeth

## Others whose work we've taken and included in NLTK, but who didn't directly contribute it:
### Contributors to the Porter Stemmer
- Martin Porter
- Vivake Gupta
- Barry Wilkins
- Hiranmay Ghosh
- Chris Emerson

0 comments on commit 6ad0304

Please sign in to comment.