A slew of improvements to the Porter Stemmer #303

ExplodingCabbage · 2012-10-28T23:47:12Z

Hi,

I'm new to GitHub so sorry in advance if I'm doing anything wrong or improper here. Drop me a message if so and I'll fix it.

I've made some changes to the Porter Stemmer. The original version of it was ported from C and so there was lots of weirdness going on. Basically, as detailed in the commit message, my changes were:

Making it threadsafe
Making it faster (~ 40%)
Making it cleaner / more Pythonic

I wrote up a consistency test and a performance test to confirm that this version spits out the same results and is faster, and ran this against the complete text of the King James Bible (804875 words). It passed the consistency test (i.e. each word stemmed to the same stem using either my improved stemmer or the old one), and on my laptop the old stemmer takes 15.9 seconds while the new stemmer takes 9.6 seconds.

Would be very cool if you'd pull my changes. :)

Regards,

Cabbage

This was to: 1) Make it threadsafe 2) Make it more Pythonic and remove at least some of the weird C-like code from when it was originally ported from C 3) Remove nonsense comments referring to things like null terminated strings and char arrays that are meaningless in a Python context (again the result of it originally being ported from C) 4) Make it faster. A test on the complete text of the King James Bible indicates that post-changes the stemmer has entirely consistent results with the previous version and is roughly 40% faster. It is also now threadsafe since nothing is stored in any instance variables any more; instead things that used to be instance variables are now passed between functions as arguments.

desilinguist · 2012-10-29T00:01:35Z

How does this compare to the Porter2 stemmer which is part of the Snowball multilingual stemmer package (http://snowball.tartarus.org/algorithms/english/stemmer.html) and supposed to be an improvement over the original Porter stemmer? It already has a python wrapper as well (http://snowball.tartarus.org/wrappers/guide.html)

Nitin

On Oct 28, 2012, at 7:47 PM, ExplodingCabbage notifications@github.com wrote:

Hi,

I'm new to GitHub so sorry in advance if I'm doing anything wrong or improper here. Drop me a message if so and I'll fix it.

I've made some changes to the Porter Stemmer. The original version of it was ported from C and so there was lots of weirdness going on. Basically, as detailed in the commit message, my changes were:

Making it threadsafe

Making it faster (~ 40%)

Making it cleaner / more Pythonic

I wrote up a consistency test and a performance test to confirm that this version spits out the same results and is faster, and ran this against the complete text of the King James Bible (804875 words). It passed the consistency test (i.e. each word stemmed to the same stem either my improved stemmer or the old one), and on my laptop the old stemmer takes 15.9 seconds while the new stemmer takes 9.6 seconds.

Would be very cool if you'd pull my changes. :)

Regards,

Cabbage

You can merge this Pull Request by running:

git pull https://github.com/ExplodingCabbage/nltk master
Or view, comment on, or merge it at:

#303

Commit Summary

Substantial rewrite of the Porter Stemmer.
Merge branch 'master' of https://github.com/nltk/nltk
File Changes

M nltk/stem/porter.py (556)
Patch Links

https://github.com/nltk/nltk/pull/303.patch
https://github.com/nltk/nltk/pull/303.diff
—
Reply to this email directly or view it on GitHub.

kmike · 2012-11-07T11:58:26Z

nltk/stem/porter.py

@@ -170,19 +162,21 @@ def __init__(self):
        for key in irregular_forms:
            for val in irregular_forms[key]:
                self.pool[val] = key
+
+        self.vowels = {'a', 'e', 'i', 'o', 'u'}


This is Python 2.7-only syntax; something like that would be Python 2.6-compatible:

self.vowels = set(['a', 'e', 'i', 'o', 'u'])

or even simpler, set('aeiou')

ExplodingCabbage · 2012-11-18T16:59:56Z

Thanks for the feedback and sorry it's taken me some time to get round to responding. I'll change the set syntax highlighted by kmike to be compatible with older Python versions, and I'll test the stemmer against Snowball's version and report back.

ExplodingCabbage · 2012-11-18T18:51:11Z

Alright, PyStemmer is much faster than even my tweaked version of the nltk one (probably not surprising since the logic is handled in C whereas we were handling the expensive stuff in Python). Even with the built-in caching feature turned off, PyStemmer gets through the King James Bible over 5 times faster than my version does - making it almost 10x faster than nltk's current PorterStemmer.

It also doesn't give the same results in all cases. There were 300 distinct words in the KJB on which the two stemmers disagreed. Disregarding proper nouns, of which there were many, the majority of the disagreements (by eye, roughly 90%) were down to two key differences:

PyStemmer doesn't stem trailing 's'es on words if the letter before the 's' is a 'u', but we do. PyStemmer's approach is better, since we, for example, stem 'virus' to 'viru' but 'viruses' to 'virus', and we stem 'fetus' to 'fetu' but 'fetuses' to 'fetus'; PyStemmer stems these to 'virus', 'virus', 'fetus' and 'fetus' respectively, which is more correct/consistent.
PyStemmer removes trailing 'ly' from words, whereas we (always? usually? not sure precisely which rules come into play here off the top of my head) convert it to 'li'. This is quite a huge improvement; it lets PyStemmer notice that adverbs have the same root as their corresponding adjective, which we are currently pretty hopeless at. For example, we stem 'motherly', 'mother', 'honestly' and 'honest' to 'motherli', 'mother', 'honestli' and 'honest', whereas PyStemmer stems them to 'mother', 'mother', 'honest' and 'honest'.

There are other inconsistencies unrelated to these two major rule differences (e.g. we stem 'his' to 'hi' whereas PyStemmer stems it to 'his'; we stem 'general' to 'gener' whereas PyStemmer stems it to 'general') but not many and none of them stood out as being bad marks for PyStemmer. I think it's fair to say it is better than nltk's implementation both in terms of speed and results.

With these observations in mind, it seems to me that the best result for nltk would be to integrate PyStemmer or otherwise wrap the Snowball stemmer in some way.

Anyway, I'd be happy to help out with this but am unsure what the best approach is. I await further advice/instructions from someone more knowledgeable and experienced in this stuff than me.

stevenbird · 2012-12-03T10:51:21Z

Thanks for the contribution and for the careful analysis. The differences in output make things slightly complicated. If there was a comprehensive test set and we could demonstrate that the rewritten stemmer is (more) correct than the existing one, it would be fine to do the replacement. Does anyone want to go to the trouble of doing that?

Otherwise, I think the right thing is to keep both versions around, and not risk breaking anyone's code that crucially depends on the existing version. It could go in nltk.stem.porter2.PorterStemmer.

desilinguist · 2012-12-03T12:03:14Z

I agree. It might actually be good to keep the older version around in case people need it for replicability.

Nitin

On Dec 3, 2012, at 5:51 AM, Steven Bird notifications@github.com wrote:

Thanks for the contribution and for the careful analysis. The differences in output make things slightly complicated. If there was a comprehensive test set and we could demonstrate that the rewritten stemmer is (more) correct than the existing one, it would be fine to do the replacement. Does anyone want to go to the trouble of doing that?

Otherwise, I think the right thing is to keep both versions around, and not risk breaking anyone's code that crucially depends on the existing version. It could go in nltk.stem.porter2.PorterStemmer.

—
Reply to this email directly or view it on GitHub.

kmike · 2012-12-03T21:44:13Z

related: #139

stevenbird · 2012-12-11T11:16:47Z

So, does anyone know if the rewritten stemmer is more correct than the existing one, considering the reference implementation (#139)? If so, we have an argument for adopting the rewritten stemmer in place of the existing one.

ExplodingCabbage · 2012-12-11T13:06:32Z

Hey. It's unsurprising to me that the existing stemmer returns results different from a strict implementation of the Porter stemming algorithm (which, if I'm reading correctly, is the 'bug' being reported in that issue report?); there are multiple overt departures from the original algorithm in the current nltk implementation, which are all marked with big all-caps comments saying ## --DEPARTURE--

I don't know for certain off the top of my head that all the inconsistencies that the issue report highlights are linked to these departure points, but I strongly suspect that they are.

Anyway, the rewritten stemmer I submitted is consistent with the existing nltk stemmer (meaning it still has the inconsistencies with a faithful Porter implementation that the existing nltk version has). Once the compatibility issue with old versions of Python are fixed (which is a one-line change), this means that what I've submitted is strictly and unambiguously an improvement over nltk's current Porter stemmer implementation (since the rewrite returns the same results as the current implementation, but faster and thread-safely). If you'd like, I'll change that set-literal to Python 2.6-compatible syntax tonight and push the change, and you can accept that.

However, there are two other courses of action that one could argue would be even better:

Tweak the rewrite to make it faithful to the original Porter stemming algorithm, removing all the points of departure.
Use the Snowball stemmer, as already discussed.

But since option # 2 seems to have already been voted against on the grounds that we don't want to create inconsistencies with people's existing data created using the existing version, I would think that we would want to avoid option # 1 for the same reason. I don't really feel entitled to make decisions about what direction to take the module in, though, so I'm just putting out there what I see as the three options (a. use the rewrite, b. use Pystemmer/Snowball, or c. tweak the rewrite to be faithful to the original Porter algorithm) and it's up to you what you pick.

With that in mind, I'll fix the 2.6-compatibility issue in the rewrite tonight and push the fix, then you guys can do with it as you see fit. :)

Regards,
Cabbage

larsmans · 2012-12-25T15:48:35Z

This patch looks surprisingly similar to the modifications I made for the Gensim version. Did you use that as a reference?

ExplodingCabbage · 2013-01-02T14:32:40Z

Just made the change to Python-2.6 syntax. Sorry that it took me so long to get around to pushing a one-liner - I am very lazy!

@larsmans No, I wasn't aware of your changes or indeed of the existence of Gensim. I probably should've done a little research and looked around for other people's Porter implementations before diving into this; that's a lesson I'll remember next time I work on anything of this nature. Naturally, feel free to use anything I've done here in the Gensim version if that'd be helpful.

stevenbird · 2013-02-11T02:13:22Z

Sorry for not responding earlier. Since this is just an efficiency improvement, not a change to the word->stem mappings, we should accept it. (I had misunderstood this at the time of my last post.)

I'd be happy to drop backwards compatibility if we had a set of changes that made NLTK's version faithful with the original Porter Stemmer.

stevenbird · 2013-02-11T03:18:07Z

Going through stem.doctest, I found an inconsistency. The existing version stemmed "traditional" to "tradit". However, with these changes, it is stemmed to "tradiat". I'll revert the changes until the situation is clarified.

ExplodingCabbage · 2013-02-11T12:15:47Z

Sounds like I badly screwed up. I'll figure out what happened tonight and fix it.

Mark Amery and others added 2 commits October 28, 2012 23:11

Merge branch 'master' of https://github.com/nltk/nltk

c1a25db

kmike reviewed Nov 7, 2012
View reviewed changes

made change to allow python 2.6-compatibility

aba6682

frozenset instead of set for vowels

cb53025

stevenbird merged commit cb53025 into nltk:master Feb 11, 2013

ExplodingCabbage mentioned this pull request Mar 2, 2013

Same improvements to the Porter stemmer as before, but without the bugs #364

Merged

turicas pushed a commit to turicas/nltk that referenced this pull request Nov 11, 2013

adopted efficiency improvements from nltk#303

cc38847

ExplodingCabbage mentioned this pull request Jan 22, 2016

Rewrite porter.py #1261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A slew of improvements to the Porter Stemmer #303

A slew of improvements to the Porter Stemmer #303

ExplodingCabbage commented Oct 28, 2012

desilinguist commented Oct 29, 2012

kmike Nov 7, 2012

kmike Dec 13, 2012

ExplodingCabbage commented Nov 18, 2012

ExplodingCabbage commented Nov 18, 2012

stevenbird commented Dec 3, 2012

desilinguist commented Dec 3, 2012

kmike commented Dec 3, 2012

stevenbird commented Dec 11, 2012

ExplodingCabbage commented Dec 11, 2012

larsmans commented Dec 25, 2012

ExplodingCabbage commented Jan 2, 2013

stevenbird commented Feb 11, 2013

stevenbird commented Feb 11, 2013

ExplodingCabbage commented Feb 11, 2013

A slew of improvements to the Porter Stemmer #303

A slew of improvements to the Porter Stemmer #303

Conversation

ExplodingCabbage commented Oct 28, 2012

desilinguist commented Oct 29, 2012

kmike Nov 7, 2012

Choose a reason for hiding this comment

kmike Dec 13, 2012

Choose a reason for hiding this comment

ExplodingCabbage commented Nov 18, 2012

ExplodingCabbage commented Nov 18, 2012

stevenbird commented Dec 3, 2012

desilinguist commented Dec 3, 2012

kmike commented Dec 3, 2012

stevenbird commented Dec 11, 2012

ExplodingCabbage commented Dec 11, 2012

larsmans commented Dec 25, 2012

ExplodingCabbage commented Jan 2, 2013

stevenbird commented Feb 11, 2013

stevenbird commented Feb 11, 2013

ExplodingCabbage commented Feb 11, 2013