-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious problem with SimpleGoodTuringProbDist #602
Comments
Thanks for the bug report. I've asked for a volunteer from nltk-dev to look into this. |
Both Good Turing and Simple Good Turing probability estimation methods assign We may detect such cases in There is also #383 which may cause troubles even if there are events with frequency > 1 - make sure you set |
Mikhail Korobov writes:
That's just not a sufficient answer when NLTK ships with SGT as the
Shouldn't the warning be accompanied by a useful return value?
Again, if this is true then this default (from ngram.py) def _estimator(fdist, bins): is broken and should be replaced. I've failed to find relevant advice/workarounds yet, will keep ht
[mail from me always has a .sig like this -- mail without it is forged spam] |
I think you are right and SimpleGoodTuring shouldn't be used for ConditionalProbDist (which NgramModel uses), especially as a default. There are many reported issues with NgramModel: GH-133, GH-157, GH-167, GH-367, GH-380, GH-388, GH-396, but unfortunately it seems nobody is actively working on fixing them right now. As a workaround you may try using LidstoneProbDist. I'm not an user of NgramModel myself, but here is an example from the docstring that should work:
|
In view of all these unresolved issues with the |
Steven Bird notifications@github.com writes:
I guess not. I've found a way to set it up to get useable results, I've started a background task of building a way to compare the But that will take a while. ht
[mail from me always has a .sig like this -- mail without it is forged spam] |
We can add language modelling functionality back in once 3.0 is finalised (and we can flag the API as experimental). I'd be interested in looking for resources to do this -- I think the functionality is important to have, both as a pure python reference implementation, and an interface to 3rd party software for large scale language modelling. |
Sorry to jump in on this out of the blue. I may be able to find someone interested in supervising me at school to work on fixing the Ngram model stuff. I've contributed a smoothing class to this package and would hate to see that axed along with everything else. |
Thanks @copper-head that would be very welcome. I'm going to take out the |
Here's a possible implementation for Good-Turing: http://phoenix.cs.clemson.edu/ldong/tutorial.php |
Hi all, I'm sorry for disappearing after saying I'd look into this. I finally have some time on my hands. |
If there's a vote on this, I would like +1 laplace and +0.001 lidstone smoothing as the default for Ngram models and still have SGT as a possible option when initializing the ngram model. For |
Ilia Kurenkov writes:
+1 for using a simpler default estimator -- it's what I do with our ht
[mail from me always has a .sig like this -- mail without it is forged spam] |
Would anyone following this thread like to take a look at the remaining issues with #800 please? |
Given any fd with all singletons, SimpleGoodTuring becomes seriously confused, assigning all probability mass to unseens:
Since this actually happens a lot with real ngram data, it renders not only SGT unusable, but the default for NgramModel.
The text was updated successfully, but these errors were encountered: