Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious problem with SimpleGoodTuringProbDist #602

Closed
htInEdin opened this issue Feb 10, 2014 · 14 comments
Closed

Serious problem with SimpleGoodTuringProbDist #602

htInEdin opened this issue Feb 10, 2014 · 14 comments

Comments

@htInEdin
Copy link

Given any fd with all singletons, SimpleGoodTuring becomes seriously confused, assigning all probability mass to unseens:

fd=FreqDist(['a','b','c'])
sgt=SimpleGoodTuringProbDist(fd)
sgt.prob('c')
0.0
sgt.prob('d')
1.0

Since this actually happens a lot with real ngram data, it renders not only SGT unusable, but the default for NgramModel.

@stevenbird
Copy link
Member

Thanks for the bug report. I've asked for a volunteer from nltk-dev to look into this.
https://groups.google.com/forum/#!topic/nltk-dev/pvusfr7OFnY

@kmike
Copy link
Member

kmike commented Feb 11, 2014

Both Good Turing and Simple Good Turing probability estimation methods assign Nr(1)/N probability mass to unknown events, and when all events are single this mass would be 1.0, leaving 0.0 for seen events. So I think it is a pathological case for SGT probability estimation method and you shouldn't use SGT for such data.

We may detect such cases in _renormalize() method and issue a warning (raising an exception seems too harsh).

There is also #383 which may cause troubles even if there are events with frequency > 1 - make sure you set bins parameter manually.

@htInEdin
Copy link
Author

Mikhail Korobov writes:

Both Good Turing and Simple Good Turing probability estimation
methods assign Nr(1)/N probability mass to unknown events, and
when all events are single this mass would be 1.0, leaving 0.0 for
seen events. So I think it is a pathological case for SGT
probability estimation method and you shouldn't use SGT for such
data.

That's just not a sufficient answer when NLTK ships with SGT as the
default estimator for NgramModel, and almost any NgramModel built from
nltk corpora will contain such cases.

We may detect such cases in _renormalize() method and issue a
warning (raising an exception seems too harsh).

Shouldn't the warning be accompanied by a useful return value?
Returning 0 will, inter alia, cause an Exception if someone has
called logprob.

There is also #383 which may
cause troubles even if there are events with frequency > 1 - make
sure you set bins parameter manually.

Again, if this is true then this default (from ngram.py)

def _estimator(fdist, bins):
"""
Default estimator function using a SimpleGoodTuringProbDist.
"""
# can't be an instance method of NgramModel as they
# can't be pickled either.
return SimpleGoodTuringProbDist(fdist)

is broken and should be replaced.

I've failed to find relevant advice/workarounds yet, will keep
looking. . .

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

@kmike
Copy link
Member

kmike commented Feb 11, 2014

I think you are right and SimpleGoodTuring shouldn't be used for ConditionalProbDist (which NgramModel uses), especially as a default. There are many reported issues with NgramModel: GH-133, GH-157, GH-167, GH-367, GH-380, GH-388, GH-396, but unfortunately it seems nobody is actively working on fixing them right now.

As a workaround you may try using LidstoneProbDist. I'm not an user of NgramModel myself, but here is an example from the docstring that should work:

>>> from nltk.corpus import brown
>>> from nltk.probability import LidstoneProbDist
>>> est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
>>> lm = NgramModel(3, brown.words(categories='news'), estimator=est)

@stevenbird
Copy link
Member

In view of all these unresolved issues with the model package, and the fact that fixes will probably impact the API, I would like to drop the package from the toolkit. It only contains about 100 lines of code, and we should probably start over. It is not mentioned in the NLTK book. We don't seem to have resources to deal with these bugs at present. We need to finalize the API for 3.0 beta. Any objections?

@htInEdin
Copy link
Author

Steven Bird notifications@github.com writes:

In view of all these unresolved issues with the model package, and
the fact that fixes will probably impact the API, I would like to
drop the package from the toolkit. It only contains about 100 lines
of code, and we should probably start over. It is not mentioned in
the NLTK book. We don't seem to have resources to deal with these
bugs at present. We need to finalize the API for 3.0 beta. Any
objections?

I guess not. I've found a way to set it up to get useable results,
for letter bigrams at least, using a slightly hacked
WittenBellProbDist as the estimator, but it's a hack.

I've started a background task of building a way to compare the
behaviour of NgramModel+[some estimator] with an NgramModel with all
probabilities taken from an ARPA-format model file, which can be
computed offline using SRILM. I'm hoping to use that to debug both
at least some of the buggy ...ProbDist classes, and NgramModel itself.

But that will take a while.

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

@stevenbird
Copy link
Member

We can add language modelling functionality back in once 3.0 is finalised (and we can flag the API as experimental). I'd be interested in looking for resources to do this -- I think the functionality is important to have, both as a pure python reference implementation, and an interface to 3rd party software for large scale language modelling.

@iliakur
Copy link
Contributor

iliakur commented Feb 16, 2014

Sorry to jump in on this out of the blue. I may be able to find someone interested in supervising me at school to work on fixing the Ngram model stuff. I've contributed a smoothing class to this package and would hate to see that axed along with everything else.

@stevenbird
Copy link
Member

Thanks @copper-head that would be very welcome. I'm going to take out the model package while we finalize NLTK 3.0, but I will leave it in a branch of the same name, and welcome pull requests anytime. Once the issues are dealt with, it can be reinstated. Feel free to discuss issues with the package on nltk-dev anytime, thanks.

stevenbird added a commit that referenced this issue Feb 20, 2014
@alvations
Copy link
Contributor

Here's a possible implementation for Good-Turing: http://phoenix.cs.clemson.edu/ldong/tutorial.php

@iliakur
Copy link
Contributor

iliakur commented Nov 17, 2014

Hi all, I'm sorry for disappearing after saying I'd look into this. I finally have some time on my hands.
@stevenbird, let me know if this question is better suited for the nltk-dev mailing list, but should we maybe reconsider using SGT as the default estimator/smoother for the Ngram model? It makes more intuitive sense to me to have something much more basic as the default and then allow folks to load SGT if they so choose. We'd still have to fix SGT, but at least providing some minimal functionality may allow us to get the model package back online?

@alvations
Copy link
Contributor

If there's a vote on this, I would like +1 laplace and +0.001 lidstone smoothing as the default for Ngram models and still have SGT as a possible option when initializing the ngram model.

For scikit-learn, laplace/lidstone smoothing is the default estimator (e.g. http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

@htInEdin
Copy link
Author

Ilia Kurenkov writes:

@stevenbird, let me know if this question is better suited for the
nltk-dev mailing list, but should we maybe reconsider using SGT as
the default estimator/smoother for the Ngram model? It makes more
intuitive sense to me to have something much more basic as the
default and then allow folks to load SGT if they so choose. We'd
still have to fix SGT, but at least providing some minimal
functionality may allow us to get the model package back online?

+1 for using a simpler default estimator -- it's what I do with our
students.

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

@stevenbird
Copy link
Member

Would anyone following this thread like to take a look at the remaining issues with #800 please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants