Rewrite porter.py · ExplodingCabbage/nltk@83f01ae

Commit

Rewrite porter.py

- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.

Loading branch information

ExplodingCabbage committed Jan 22, 2016

1 parent 57511eb commit 83f01ae

AUTHORS.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -189,3 +189,11 @@ @@
     - Sergio Oller
     - Will Monroe
     - Elijah Rippeth
+    ## Others whose work we've taken and included in NLTK, but who didn't directly contribute it:
+    ### Contributors to the Porter Stemmer
+    - Martin Porter
+    - Vivake Gupta
+    - Barry Wilkins
+    - Hiranmay Ghosh
+    - Chris Emerson

0 comments on commit `83f01ae`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `83f01ae`

Commit

There are no files selected for viewing

0 comments on commit 83f01ae

0 comments on commit `83f01ae`