Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite porter.py #1261

Merged
merged 1 commit into from Sep 10, 2016
Merged

Rewrite porter.py #1261

merged 1 commit into from Sep 10, 2016

Conversation

ExplodingCabbage
Copy link
Contributor

The commit message sums up most of what I've done here.

Long ago when I was starting out at programming I contributed some pull requests relating to NLTK's version of the Porter stemmer, but did a bad job of it. This is a more thorough attack that eliminates the remaining problems. In particular, it fulfils @stevenbird's request for a more faithful stemmer. (Although this is complicated! See the documentation about modes in the proposed patch.)

Some particular points of concern for reviewers to look at:

  • I couldn't get the full test suite to run and, not really knowing how unittest works, had to hack about a bit at the interpreter to run mine. You might want to make sure that they work when run as part of the whole suite and that there's nothing I should've done to add my new tests to the suite that I've missed
  • I don't know how docstrings get converted into docs pages at http://www.nltk.org/index.html, so I've just tried to make them as plaintext-readable as possible. They might need tidying up to make them web-ready.
  • I've chosen, to preserve backwards-compatibility with the current version of NLTK as much as possible, to use PorterStemmer.NLTK_EXTENSIONS as the default mode. If you care about getting people to use a version of the stemmer consistent with Martin's reference implementations more than you care about backwards compatibility, you might want to use PorterStemmer.MARTIN_EXTENSIONS here instead. I leave it up to you.

@alvations
Copy link
Contributor

+1 Kudos! It looks much better now with the 3 different implementations/extensions.

@ExplodingCabbage
Copy link
Contributor Author

Quick query since this is now being looked at: was I right to commit the >90000 words of test cases straight into the repo, or should I have put that stuff somewhere else where it could be fetched by the downloader in order to minimise repo size? I didn't look into doing things the latter way; if I should've done, and somebody would like to point be towards docs on how to do so, I'll be happy to comply and rebase.

@alvations
Copy link
Contributor

I think it's best to create a separate CorpusReader object and push it to nltk_data, then load the dataset in the test.

But i'm not the best person to know how this should work. I'm also trying to push some character sets and word lists to test tokenizers and lemmatizers. But I don't think putting them in the core code or test code is appropriate. I would like to know what works best too.

@stevenbird
Copy link
Member

@alvations is right, and if it's just a word list we can use the existing corpus reader for that, and I can deal with it myself

@ExplodingCabbage
Copy link
Contributor Author

I can deal with it myself

To clarify, do you mean that I should do nothing and you will add this stuff to the data repository and rebase my PR yourself to remove the data, or would you like me to take some action?

@alvations
Copy link
Contributor

@ExplodingCabbage I've added new classes to read my word/character lists like this https://github.com/nltk/nltk/pull/1282/files but I'm still not sure whether I'm doing it correctly.

I think @stevenbird will change the corpora readers to fit the nltk.corpus module when you merge the list after creating something like https://github.com/alvations/nltk/blob/develop/nltk/corpus/__init__.py#L280 and https://github.com/alvations/nltk/blob/develop/nltk/corpus/reader/__init__.py#L145 and https://github.com/alvations/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L41

@stevenbird stevenbird self-assigned this Feb 17, 2016
@stevenbird
Copy link
Member

@ExplodingCabbage – sorry for the delay. I guess I need to migrate the test files into the NLTK corpus collection before we go any further. I hope to get onto that very soon.

@ExplodingCabbage
Copy link
Contributor Author

No worries, @stevenbird, life happens and so does prioritisation.

@stevenbird
Copy link
Member

@ExplodingCabbage: would you please point me at the test data so I can add it to the NLTK data collection?

stevenbird added a commit to nltk/nltk_data that referenced this pull request Aug 22, 2016
@stevenbird
Copy link
Member

@ExplodingCabbage: sorry for the delay – the files are now in NLTK Data. Get them with sudo python -m nltk.downloader porter_test. And access them using nltk.data.find, e.g.:

from nltk import data
f = data.find('stemmers/porter_test/porter_nltk_output.txt').open()
f.read().split()
['a', 'aaron', 'abaissiez', 'abandon', 'abandon', 'abas', 'abash', 'abat', 'abat', 'abat', ...]

@ExplodingCabbage
Copy link
Contributor Author

Roger. Will get this PR ready for merge tonight.

@ExplodingCabbage
Copy link
Contributor Author

ExplodingCabbage commented Sep 4, 2016

I've rebased 6ad0304 on top of the current develop, creating ca3b475. Will now modify the commit again to remove the test data and use nltk.data.find. Once that's done, this should be ready to merge.

- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version.

  This point resolves nltk#139
- As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true)
- Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary.
- Fixed the demo
- Made code at least roughly comply with PEP 8
- Purged copyright notice wrongly attributing authorship
- Moved comments about contributors into the contributors file, where they better belong
- Made function names more verbose and algorithm details more simple with the aim of improving readability
- Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt
- Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from

Minor compatability-breaking changes, with justification:

- Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong
- Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.
@ExplodingCabbage
Copy link
Contributor Author

Alright, I've rewritten the commit to obliterate the testing data. When I hackily run the tests by pasting the test class into my Python (2 or 3) shell and running the test_foo methods individually, they pass. I still wasn't able to run the full test suite, unfortunately.

I think this is ready for review and merge?

@stevenbird
Copy link
Member

Thanks @ExplodingCabbage, this is great. I'll merge and see how the testing goes on our CI server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants