New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A slew of improvements to the Porter Stemmer #303
Conversation
This was to: 1) Make it threadsafe 2) Make it more Pythonic and remove at least some of the weird C-like code from when it was originally ported from C 3) Remove nonsense comments referring to things like null terminated strings and char arrays that are meaningless in a Python context (again the result of it originally being ported from C) 4) Make it faster. A test on the complete text of the King James Bible indicates that post-changes the stemmer has entirely consistent results with the previous version and is roughly 40% faster. It is also now threadsafe since nothing is stored in any instance variables any more; instead things that used to be instance variables are now passed between functions as arguments.
How does this compare to the Porter2 stemmer which is part of the Snowball multilingual stemmer package (http://snowball.tartarus.org/algorithms/english/stemmer.html) and supposed to be an improvement over the original Porter stemmer? It already has a python wrapper as well (http://snowball.tartarus.org/wrappers/guide.html)
On Oct 28, 2012, at 7:47 PM, ExplodingCabbage notifications@github.com wrote:
|
@@ -170,19 +162,21 @@ def __init__(self): | |||
for key in irregular_forms: | |||
for val in irregular_forms[key]: | |||
self.pool[val] = key | |||
|
|||
self.vowels = {'a', 'e', 'i', 'o', 'u'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Python 2.7-only syntax; something like that would be Python 2.6-compatible:
self.vowels = set(['a', 'e', 'i', 'o', 'u'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even simpler, set('aeiou')
Thanks for the feedback and sorry it's taken me some time to get round to responding. I'll change the set syntax highlighted by kmike to be compatible with older Python versions, and I'll test the stemmer against Snowball's version and report back. |
Alright, PyStemmer is much faster than even my tweaked version of the nltk one (probably not surprising since the logic is handled in C whereas we were handling the expensive stuff in Python). Even with the built-in caching feature turned off, PyStemmer gets through the King James Bible over 5 times faster than my version does - making it almost 10x faster than nltk's current PorterStemmer. It also doesn't give the same results in all cases. There were 300 distinct words in the KJB on which the two stemmers disagreed. Disregarding proper nouns, of which there were many, the majority of the disagreements (by eye, roughly 90%) were down to two key differences:
There are other inconsistencies unrelated to these two major rule differences (e.g. we stem 'his' to 'hi' whereas PyStemmer stems it to 'his'; we stem 'general' to 'gener' whereas PyStemmer stems it to 'general') but not many and none of them stood out as being bad marks for PyStemmer. I think it's fair to say it is better than nltk's implementation both in terms of speed and results. With these observations in mind, it seems to me that the best result for nltk would be to integrate PyStemmer or otherwise wrap the Snowball stemmer in some way. Anyway, I'd be happy to help out with this but am unsure what the best approach is. I await further advice/instructions from someone more knowledgeable and experienced in this stuff than me. |
Thanks for the contribution and for the careful analysis. The differences in output make things slightly complicated. If there was a comprehensive test set and we could demonstrate that the rewritten stemmer is (more) correct than the existing one, it would be fine to do the replacement. Does anyone want to go to the trouble of doing that? Otherwise, I think the right thing is to keep both versions around, and not risk breaking anyone's code that crucially depends on the existing version. It could go in nltk.stem.porter2.PorterStemmer. |
I agree. It might actually be good to keep the older version around in case people need it for replicability.
On Dec 3, 2012, at 5:51 AM, Steven Bird notifications@github.com wrote:
|
related: #139 |
So, does anyone know if the rewritten stemmer is more correct than the existing one, considering the reference implementation (#139)? If so, we have an argument for adopting the rewritten stemmer in place of the existing one. |
Hey. It's unsurprising to me that the existing stemmer returns results different from a strict implementation of the Porter stemming algorithm (which, if I'm reading correctly, is the 'bug' being reported in that issue report?); there are multiple overt departures from the original algorithm in the current nltk implementation, which are all marked with big all-caps comments saying I don't know for certain off the top of my head that all the inconsistencies that the issue report highlights are linked to these departure points, but I strongly suspect that they are. Anyway, the rewritten stemmer I submitted is consistent with the existing nltk stemmer (meaning it still has the inconsistencies with a faithful Porter implementation that the existing nltk version has). Once the compatibility issue with old versions of Python are fixed (which is a one-line change), this means that what I've submitted is strictly and unambiguously an improvement over nltk's current Porter stemmer implementation (since the rewrite returns the same results as the current implementation, but faster and thread-safely). If you'd like, I'll change that set-literal to Python 2.6-compatible syntax tonight and push the change, and you can accept that. However, there are two other courses of action that one could argue would be even better:
But since option # 2 seems to have already been voted against on the grounds that we don't want to create inconsistencies with people's existing data created using the existing version, I would think that we would want to avoid option # 1 for the same reason. I don't really feel entitled to make decisions about what direction to take the module in, though, so I'm just putting out there what I see as the three options (a. use the rewrite, b. use Pystemmer/Snowball, or c. tweak the rewrite to be faithful to the original Porter algorithm) and it's up to you what you pick. With that in mind, I'll fix the 2.6-compatibility issue in the rewrite tonight and push the fix, then you guys can do with it as you see fit. :) Regards, |
This patch looks surprisingly similar to the modifications I made for the Gensim version. Did you use that as a reference? |
Just made the change to Python-2.6 syntax. Sorry that it took me so long to get around to pushing a one-liner - I am very lazy! @larsmans No, I wasn't aware of your changes or indeed of the existence of Gensim. I probably should've done a little research and looked around for other people's Porter implementations before diving into this; that's a lesson I'll remember next time I work on anything of this nature. Naturally, feel free to use anything I've done here in the Gensim version if that'd be helpful. |
Sorry for not responding earlier. Since this is just an efficiency improvement, not a change to the word->stem mappings, we should accept it. (I had misunderstood this at the time of my last post.) I'd be happy to drop backwards compatibility if we had a set of changes that made NLTK's version faithful with the original Porter Stemmer. |
Going through stem.doctest, I found an inconsistency. The existing version stemmed "traditional" to "tradit". However, with these changes, it is stemmed to "tradiat". I'll revert the changes until the situation is clarified. |
Sounds like I badly screwed up. I'll figure out what happened tonight and fix it. |
Hi,
I'm new to GitHub so sorry in advance if I'm doing anything wrong or improper here. Drop me a message if so and I'll fix it.
I've made some changes to the Porter Stemmer. The original version of it was ported from C and so there was lots of weirdness going on. Basically, as detailed in the commit message, my changes were:
I wrote up a consistency test and a performance test to confirm that this version spits out the same results and is faster, and ran this against the complete text of the King James Bible (804875 words). It passed the consistency test (i.e. each word stemmed to the same stem using either my improved stemmer or the old one), and on my laptop the old stemmer takes 15.9 seconds while the new stemmer takes 9.6 seconds.
Would be very cool if you'd pull my changes. :)
Regards,
Cabbage