Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
- Loading branch information