Rewrite porter.py #1261

ExplodingCabbage · 2016-01-22T00:54:13Z

The commit message sums up most of what I've done here.

Long ago when I was starting out at programming I contributed some pull requests relating to NLTK's version of the Porter stemmer, but did a bad job of it. This is a more thorough attack that eliminates the remaining problems. In particular, it fulfils @stevenbird's request for a more faithful stemmer. (Although this is complicated! See the documentation about modes in the proposed patch.)

Some particular points of concern for reviewers to look at:

I couldn't get the full test suite to run and, not really knowing how unittest works, had to hack about a bit at the interpreter to run mine. You might want to make sure that they work when run as part of the whole suite and that there's nothing I should've done to add my new tests to the suite that I've missed
I don't know how docstrings get converted into docs pages at http://www.nltk.org/index.html, so I've just tried to make them as plaintext-readable as possible. They might need tidying up to make them web-ready.
I've chosen, to preserve backwards-compatibility with the current version of NLTK as much as possible, to use PorterStemmer.NLTK_EXTENSIONS as the default mode. If you care about getting people to use a version of the stemmer consistent with Martin's reference implementations more than you care about backwards compatibility, you might want to use PorterStemmer.MARTIN_EXTENSIONS here instead. I leave it up to you.

alvations · 2016-02-15T10:02:48Z

+1 Kudos! It looks much better now with the 3 different implementations/extensions.

ExplodingCabbage · 2016-02-15T10:29:01Z

Quick query since this is now being looked at: was I right to commit the >90000 words of test cases straight into the repo, or should I have put that stuff somewhere else where it could be fetched by the downloader in order to minimise repo size? I didn't look into doing things the latter way; if I should've done, and somebody would like to point be towards docs on how to do so, I'll be happy to comply and rebase.

alvations · 2016-02-15T10:35:28Z

I think it's best to create a separate CorpusReader object and push it to nltk_data, then load the dataset in the test.

But i'm not the best person to know how this should work. I'm also trying to push some character sets and word lists to test tokenizers and lemmatizers. But I don't think putting them in the core code or test code is appropriate. I would like to know what works best too.

stevenbird · 2016-02-15T17:19:34Z

@alvations is right, and if it's just a word list we can use the existing corpus reader for that, and I can deal with it myself

ExplodingCabbage · 2016-02-15T17:33:08Z

I can deal with it myself

To clarify, do you mean that I should do nothing and you will add this stuff to the data repository and rebase my PR yourself to remove the data, or would you like me to take some action?

alvations · 2016-02-16T14:25:18Z

@ExplodingCabbage I've added new classes to read my word/character lists like this https://github.com/nltk/nltk/pull/1282/files but I'm still not sure whether I'm doing it correctly.

I think @stevenbird will change the corpora readers to fit the nltk.corpus module when you merge the list after creating something like https://github.com/alvations/nltk/blob/develop/nltk/corpus/__init__.py#L280 and https://github.com/alvations/nltk/blob/develop/nltk/corpus/reader/__init__.py#L145 and https://github.com/alvations/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L41

stevenbird · 2016-06-20T15:00:30Z

@ExplodingCabbage – sorry for the delay. I guess I need to migrate the test files into the NLTK corpus collection before we go any further. I hope to get onto that very soon.

ExplodingCabbage · 2016-06-20T15:28:22Z

No worries, @stevenbird, life happens and so does prioritisation.

stevenbird · 2016-07-13T23:27:01Z

@ExplodingCabbage: would you please point me at the test data so I can add it to the NLTK data collection?

ExplodingCabbage · 2016-07-14T08:00:46Z

@stevenbird:

stevenbird · 2016-08-22T23:58:33Z

@ExplodingCabbage: sorry for the delay – the files are now in NLTK Data. Get them with sudo python -m nltk.downloader porter_test. And access them using nltk.data.find, e.g.:

from nltk import data
f = data.find('stemmers/porter_test/porter_nltk_output.txt').open()
f.read().split()
['a', 'aaron', 'abaissiez', 'abandon', 'abandon', 'abas', 'abash', 'abat', 'abat', 'abat', ...]

ExplodingCabbage · 2016-08-23T09:05:46Z

Roger. Will get this PR ready for merge tonight.

ExplodingCabbage · 2016-09-04T20:52:50Z

I've rebased 6ad0304 on top of the current develop, creating ca3b475. Will now modify the commit again to remove the test data and use nltk.data.find. Once that's done, this should be ready to merge.

- Added three modes: one faithful to the original paper, one to M Porter's extended version, and one to NLTK's extended version. This point resolves nltk#139 - As a consequence, ensured all NLTK-specific departures were clearly marked (wasn't previously true) - Added unit tests. These use Martin Porter's recommended testing vocabulary from http://tartarus.org/martin/PorterStemmer/voc.txt. Expected output for the Martin-extended version of the stemmer is likewise taken from his site; expected output for the original and NLTK versions was generated by running, respectively, the C version of the stemmer written by Martin and the NLTK version of the stemmer prior to this commit against the testing vocabulary. - Fixed the demo - Made code at least roughly comply with PEP 8 - Purged copyright notice wrongly attributing authorship - Moved comments about contributors into the contributors file, where they better belong - Made function names more verbose and algorithm details more simple with the aim of improving readability - Documented steps in the algorithm with quotes from the original paper by Martin Porter, currently hosted at http://tartarus.org/martin/PorterStemmer/def.txt - Removed a load of commented-out code that I guess had originally been taken from whatever source porter.py was originally copied into NLTK from Minor compatability-breaking changes, with justification: - Removed call to _adjust_case prior to stemming; this isn't part of the Porter algorithm, and isn't done by other NLTK stemmers like Lancaster or Snowball, so it seemed wrong - Remove stem_word from PorterStemmer. It does pretty much the same as stem(), and isn't part of the StemmerI interface. Anybody who was previously using it (hopefully nobody) can just change their code to call stem() if they update NLTK.

ExplodingCabbage · 2016-09-10T21:15:43Z

Alright, I've rewritten the commit to obliterate the testing data. When I hackily run the tests by pasting the test class into my Python (2 or 3) shell and running the test_foo methods individually, they pass. I still wasn't able to run the full test suite, unfortunately.

I think this is ready for review and merge?

stevenbird · 2016-09-10T23:36:36Z

Thanks @ExplodingCabbage, this is great. I'll merge and see how the testing goes on our CI server.

ExplodingCabbage force-pushed the PorterRewrite branch from 83f01ae to 6ad0304 Compare January 23, 2016 21:50

alvations added the enhancement label Feb 15, 2016

stevenbird self-assigned this Feb 17, 2016

stevenbird added a commit to nltk/nltk_data that referenced this pull request Aug 22, 2016

added Porter Stemmer test files, cf nltk/nltk#1261

300619a

ExplodingCabbage force-pushed the PorterRewrite branch from 6ad0304 to ca3b475 Compare September 4, 2016 20:52

ExplodingCabbage mentioned this pull request Sep 10, 2016

Better document how to run the tests #1466

Closed

ExplodingCabbage force-pushed the PorterRewrite branch from ca3b475 to d8402e3 Compare September 10, 2016 21:13

stevenbird merged commit 4abdf67 into nltk:develop Sep 10, 2016

ExplodingCabbage deleted the PorterRewrite branch September 12, 2016 12:36

ExplodingCabbage mentioned this pull request Jan 7, 2017

Fix Porter stemmer failing on 'oed' #1582

Merged

Diyago mentioned this pull request May 21, 2017

“IndexError: string index out of range” with nltk library - to delete #1732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite porter.py #1261

Rewrite porter.py #1261

ExplodingCabbage commented Jan 22, 2016

alvations commented Feb 15, 2016

ExplodingCabbage commented Feb 15, 2016

alvations commented Feb 15, 2016

stevenbird commented Feb 15, 2016

ExplodingCabbage commented Feb 15, 2016

alvations commented Feb 16, 2016

stevenbird commented Jun 20, 2016

ExplodingCabbage commented Jun 20, 2016

stevenbird commented Jul 13, 2016

ExplodingCabbage commented Jul 14, 2016

stevenbird commented Aug 22, 2016

ExplodingCabbage commented Aug 23, 2016

ExplodingCabbage commented Sep 4, 2016 •

edited

ExplodingCabbage commented Sep 10, 2016

stevenbird commented Sep 10, 2016

Rewrite porter.py #1261

Rewrite porter.py #1261

Conversation

ExplodingCabbage commented Jan 22, 2016

alvations commented Feb 15, 2016

ExplodingCabbage commented Feb 15, 2016

alvations commented Feb 15, 2016

stevenbird commented Feb 15, 2016

ExplodingCabbage commented Feb 15, 2016

alvations commented Feb 16, 2016

stevenbird commented Jun 20, 2016

ExplodingCabbage commented Jun 20, 2016

stevenbird commented Jul 13, 2016

ExplodingCabbage commented Jul 14, 2016

stevenbird commented Aug 22, 2016

ExplodingCabbage commented Aug 23, 2016

ExplodingCabbage commented Sep 4, 2016 • edited

ExplodingCabbage commented Sep 10, 2016

stevenbird commented Sep 10, 2016

ExplodingCabbage commented Sep 4, 2016 •

edited