New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite porter.py #1261
Rewrite porter.py #1261
Conversation
83f01ae
to
6ad0304
Compare
+1 Kudos! It looks much better now with the 3 different implementations/extensions. |
Quick query since this is now being looked at: was I right to commit the >90000 words of test cases straight into the repo, or should I have put that stuff somewhere else where it could be fetched by the downloader in order to minimise repo size? I didn't look into doing things the latter way; if I should've done, and somebody would like to point be towards docs on how to do so, I'll be happy to comply and rebase. |
I think it's best to create a separate But i'm not the best person to know how this should work. I'm also trying to push some character sets and word lists to test tokenizers and lemmatizers. But I don't think putting them in the core code or test code is appropriate. I would like to know what works best too. |
@alvations is right, and if it's just a word list we can use the existing corpus reader for that, and I can deal with it myself |
To clarify, do you mean that I should do nothing and you will add this stuff to the data repository and rebase my PR yourself to remove the data, or would you like me to take some action? |
@ExplodingCabbage I've added new classes to read my word/character lists like this https://github.com/nltk/nltk/pull/1282/files but I'm still not sure whether I'm doing it correctly. I think @stevenbird will change the corpora readers to fit the |
@ExplodingCabbage – sorry for the delay. I guess I need to migrate the test files into the NLTK corpus collection before we go any further. I hope to get onto that very soon. |
No worries, @stevenbird, life happens and so does prioritisation. |
@ExplodingCabbage: would you please point me at the test data so I can add it to the NLTK data collection? |
@ExplodingCabbage: sorry for the delay – the files are now in NLTK Data. Get them with
|
Roger. Will get this PR ready for merge tonight. |
6ad0304
to
ca3b475
Compare
- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
ca3b475
to
d8402e3
Compare
Alright, I've rewritten the commit to obliterate the testing data. When I hackily run the tests by pasting the test class into my Python (2 or 3) shell and running the I think this is ready for review and merge? |
Thanks @ExplodingCabbage, this is great. I'll merge and see how the testing goes on our CI server. |
The commit message sums up most of what I've done here.
Long ago when I was starting out at programming I contributed some pull requests relating to NLTK's version of the Porter stemmer, but did a bad job of it. This is a more thorough attack that eliminates the remaining problems. In particular, it fulfils @stevenbird's request for a more faithful stemmer. (Although this is complicated! See the documentation about modes in the proposed patch.)
Some particular points of concern for reviewers to look at:
unittest
works, had to hack about a bit at the interpreter to run mine. You might want to make sure that they work when run as part of the whole suite and that there's nothing I should've done to add my new tests to the suite that I've missedPorterStemmer.NLTK_EXTENSIONS
as the default mode. If you care about getting people to use a version of the stemmer consistent with Martin's reference implementations more than you care about backwards compatibility, you might want to usePorterStemmer.MARTIN_EXTENSIONS
here instead. I leave it up to you.