Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite porter.py #1261

Merged
merged 1 commit into from
Sep 10, 2016
Merged

Rewrite porter.py #1261

merged 1 commit into from
Sep 10, 2016

Commits on Sep 10, 2016

  1. Rewrite porter.py

    - Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.
    
      This point resolves nltk#139
    - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
    - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
    - Fixed the demo
    - Made code at least roughly comply with PEP 8
    - Purged copyright notice wrongly attributing authorship
    - Moved comments about contributors into the contributors file, where they better belong
    - Made function names more verbose and algorithm details more simple with the aim of improving readability
    - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
    - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from
    
    Minor compatability-breaking changes, with justification:
    
    - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
    - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
    ExplodingCabbage committed Sep 10, 2016
    Configuration menu
    Copy the full SHA
    d8402e3 View commit details
    Browse the repository at this point in the history