Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A slew of improvements to the Porter Stemmer #303

Merged
merged 4 commits into from Feb 11, 2013
Merged

A slew of improvements to the Porter Stemmer #303

merged 4 commits into from Feb 11, 2013

Conversation

ExplodingCabbage
Copy link
Contributor

Hi,

I'm new to GitHub so sorry in advance if I'm doing anything wrong or improper here. Drop me a message if so and I'll fix it.

I've made some changes to the Porter Stemmer. The original version of it was ported from C and so there was lots of weirdness going on. Basically, as detailed in the commit message, my changes were:

  1. Making it threadsafe
  2. Making it faster (~ 40%)
  3. Making it cleaner / more Pythonic

I wrote up a consistency test and a performance test to confirm that this version spits out the same results and is faster, and ran this against the complete text of the King James Bible (804875 words). It passed the consistency test (i.e. each word stemmed to the same stem using either my improved stemmer or the old one), and on my laptop the old stemmer takes 15.9 seconds while the new stemmer takes 9.6 seconds.

Would be very cool if you'd pull my changes. :)

Regards,

Cabbage

Mark Amery and others added 2 commits October 28, 2012 23:11
This was to:
1) Make it threadsafe
2) Make it more Pythonic and remove at least some of the weird C-like code from
   when it was originally ported from C
3) Remove nonsense comments referring to things like null terminated strings and
   char arrays that are meaningless in a Python context (again the result of it
   originally being ported from C)
4) Make it faster.

A test on the complete text of the King James Bible indicates that post-changes
the stemmer has entirely consistent results with the previous version and is
roughly 40% faster. It is also now threadsafe since nothing is stored in
any instance variables any more; instead things that used to be instance
variables are now passed between functions as arguments.
@desilinguist
Copy link
Contributor

How does this compare to the Porter2 stemmer which is part of the Snowball multilingual stemmer package (http://snowball.tartarus.org/algorithms/english/stemmer.html) and supposed to be an improvement over the original Porter stemmer? It already has a python wrapper as well (http://snowball.tartarus.org/wrappers/guide.html)

  • Nitin

On Oct 28, 2012, at 7:47 PM, ExplodingCabbage notifications@github.com wrote:

Hi,

I'm new to GitHub so sorry in advance if I'm doing anything wrong or improper here. Drop me a message if so and I'll fix it.

I've made some changes to the Porter Stemmer. The original version of it was ported from C and so there was lots of weirdness going on. Basically, as detailed in the commit message, my changes were:

  1. Making it threadsafe
  2. Making it faster (~ 40%)
  3. Making it cleaner / more Pythonic

I wrote up a consistency test and a performance test to confirm that this version spits out the same results and is faster, and ran this against the complete text of the King James Bible (804875 words). It passed the consistency test (i.e. each word stemmed to the same stem either my improved stemmer or the old one), and on my laptop the old stemmer takes 15.9 seconds while the new stemmer takes 9.6 seconds.

Would be very cool if you'd pull my changes. :)

Regards,

Cabbage

You can merge this Pull Request by running:

git pull https://github.com/ExplodingCabbage/nltk master
Or view, comment on, or merge it at:

#303

Commit Summary

Substantial rewrite of the Porter Stemmer.
Merge branch 'master' of https://github.com/nltk/nltk
File Changes

M nltk/stem/porter.py (556)
Patch Links

https://github.com/nltk/nltk/pull/303.patch
https://github.com/nltk/nltk/pull/303.diff

Reply to this email directly or view it on GitHub.

@@ -170,19 +162,21 @@ def __init__(self):
for key in irregular_forms:
for val in irregular_forms[key]:
self.pool[val] = key

self.vowels = {'a', 'e', 'i', 'o', 'u'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Python 2.7-only syntax; something like that would be Python 2.6-compatible:

self.vowels = set(['a', 'e', 'i', 'o', 'u'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even simpler, set('aeiou')

@ExplodingCabbage
Copy link
Contributor Author

Thanks for the feedback and sorry it's taken me some time to get round to responding. I'll change the set syntax highlighted by kmike to be compatible with older Python versions, and I'll test the stemmer against Snowball's version and report back.

@ExplodingCabbage
Copy link
Contributor Author

Alright, PyStemmer is much faster than even my tweaked version of the nltk one (probably not surprising since the logic is handled in C whereas we were handling the expensive stuff in Python). Even with the built-in caching feature turned off, PyStemmer gets through the King James Bible over 5 times faster than my version does - making it almost 10x faster than nltk's current PorterStemmer.

It also doesn't give the same results in all cases. There were 300 distinct words in the KJB on which the two stemmers disagreed. Disregarding proper nouns, of which there were many, the majority of the disagreements (by eye, roughly 90%) were down to two key differences:

  1. PyStemmer doesn't stem trailing 's'es on words if the letter before the 's' is a 'u', but we do. PyStemmer's approach is better, since we, for example, stem 'virus' to 'viru' but 'viruses' to 'virus', and we stem 'fetus' to 'fetu' but 'fetuses' to 'fetus'; PyStemmer stems these to 'virus', 'virus', 'fetus' and 'fetus' respectively, which is more correct/consistent.

  2. PyStemmer removes trailing 'ly' from words, whereas we (always? usually? not sure precisely which rules come into play here off the top of my head) convert it to 'li'. This is quite a huge improvement; it lets PyStemmer notice that adverbs have the same root as their corresponding adjective, which we are currently pretty hopeless at. For example, we stem 'motherly', 'mother', 'honestly' and 'honest' to 'motherli', 'mother', 'honestli' and 'honest', whereas PyStemmer stems them to 'mother', 'mother', 'honest' and 'honest'.

There are other inconsistencies unrelated to these two major rule differences (e.g. we stem 'his' to 'hi' whereas PyStemmer stems it to 'his'; we stem 'general' to 'gener' whereas PyStemmer stems it to 'general') but not many and none of them stood out as being bad marks for PyStemmer. I think it's fair to say it is better than nltk's implementation both in terms of speed and results.

With these observations in mind, it seems to me that the best result for nltk would be to integrate PyStemmer or otherwise wrap the Snowball stemmer in some way.

Anyway, I'd be happy to help out with this but am unsure what the best approach is. I await further advice/instructions from someone more knowledgeable and experienced in this stuff than me.

@stevenbird
Copy link
Member

Thanks for the contribution and for the careful analysis. The differences in output make things slightly complicated. If there was a comprehensive test set and we could demonstrate that the rewritten stemmer is (more) correct than the existing one, it would be fine to do the replacement. Does anyone want to go to the trouble of doing that?

Otherwise, I think the right thing is to keep both versions around, and not risk breaking anyone's code that crucially depends on the existing version. It could go in nltk.stem.porter2.PorterStemmer.

@desilinguist
Copy link
Contributor

I agree. It might actually be good to keep the older version around in case people need it for replicability.

  • Nitin

On Dec 3, 2012, at 5:51 AM, Steven Bird notifications@github.com wrote:

Thanks for the contribution and for the careful analysis. The differences in output make things slightly complicated. If there was a comprehensive test set and we could demonstrate that the rewritten stemmer is (more) correct than the existing one, it would be fine to do the replacement. Does anyone want to go to the trouble of doing that?

Otherwise, I think the right thing is to keep both versions around, and not risk breaking anyone's code that crucially depends on the existing version. It could go in nltk.stem.porter2.PorterStemmer.


Reply to this email directly or view it on GitHub.

@kmike
Copy link
Member

kmike commented Dec 3, 2012

related: #139

@stevenbird
Copy link
Member

So, does anyone know if the rewritten stemmer is more correct than the existing one, considering the reference implementation (#139)? If so, we have an argument for adopting the rewritten stemmer in place of the existing one.

@ExplodingCabbage
Copy link
Contributor Author

Hey. It's unsurprising to me that the existing stemmer returns results different from a strict implementation of the Porter stemming algorithm (which, if I'm reading correctly, is the 'bug' being reported in that issue report?); there are multiple overt departures from the original algorithm in the current nltk implementation, which are all marked with big all-caps comments saying ## --DEPARTURE--

I don't know for certain off the top of my head that all the inconsistencies that the issue report highlights are linked to these departure points, but I strongly suspect that they are.

Anyway, the rewritten stemmer I submitted is consistent with the existing nltk stemmer (meaning it still has the inconsistencies with a faithful Porter implementation that the existing nltk version has). Once the compatibility issue with old versions of Python are fixed (which is a one-line change), this means that what I've submitted is strictly and unambiguously an improvement over nltk's current Porter stemmer implementation (since the rewrite returns the same results as the current implementation, but faster and thread-safely). If you'd like, I'll change that set-literal to Python 2.6-compatible syntax tonight and push the change, and you can accept that.

However, there are two other courses of action that one could argue would be even better:

  1. Tweak the rewrite to make it faithful to the original Porter stemming algorithm, removing all the points of departure.
  2. Use the Snowball stemmer, as already discussed.

But since option # 2 seems to have already been voted against on the grounds that we don't want to create inconsistencies with people's existing data created using the existing version, I would think that we would want to avoid option # 1 for the same reason. I don't really feel entitled to make decisions about what direction to take the module in, though, so I'm just putting out there what I see as the three options (a. use the rewrite, b. use Pystemmer/Snowball, or c. tweak the rewrite to be faithful to the original Porter algorithm) and it's up to you what you pick.

With that in mind, I'll fix the 2.6-compatibility issue in the rewrite tonight and push the fix, then you guys can do with it as you see fit. :)

Regards,
Cabbage

@larsmans
Copy link
Contributor

This patch looks surprisingly similar to the modifications I made for the Gensim version. Did you use that as a reference?

@ExplodingCabbage
Copy link
Contributor Author

Just made the change to Python-2.6 syntax. Sorry that it took me so long to get around to pushing a one-liner - I am very lazy!

@larsmans No, I wasn't aware of your changes or indeed of the existence of Gensim. I probably should've done a little research and looked around for other people's Porter implementations before diving into this; that's a lesson I'll remember next time I work on anything of this nature. Naturally, feel free to use anything I've done here in the Gensim version if that'd be helpful.

@stevenbird
Copy link
Member

Sorry for not responding earlier. Since this is just an efficiency improvement, not a change to the word->stem mappings, we should accept it. (I had misunderstood this at the time of my last post.)

I'd be happy to drop backwards compatibility if we had a set of changes that made NLTK's version faithful with the original Porter Stemmer.

@stevenbird stevenbird merged commit cb53025 into nltk:master Feb 11, 2013
@stevenbird
Copy link
Member

Going through stem.doctest, I found an inconsistency. The existing version stemmed "traditional" to "tradit". However, with these changes, it is stemmed to "tradiat". I'll revert the changes until the situation is clarified.

@ExplodingCabbage
Copy link
Contributor Author

Sounds like I badly screwed up. I'll figure out what happened tonight and fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants