Serious problem with SimpleGoodTuringProbDist #602

htInEdin · 2014-02-10T21:41:05Z

Given any fd with all singletons, SimpleGoodTuring becomes seriously confused, assigning all probability mass to unseens:

fd=FreqDist(['a','b','c'])
sgt=SimpleGoodTuringProbDist(fd)
sgt.prob('c')
0.0
sgt.prob('d')
1.0

Since this actually happens a lot with real ngram data, it renders not only SGT unusable, but the default for NgramModel.

stevenbird · 2014-02-10T22:38:39Z

Thanks for the bug report. I've asked for a volunteer from nltk-dev to look into this.
https://groups.google.com/forum/#!topic/nltk-dev/pvusfr7OFnY

kmike · 2014-02-11T07:11:10Z

Both Good Turing and Simple Good Turing probability estimation methods assign Nr(1)/N probability mass to unknown events, and when all events are single this mass would be 1.0, leaving 0.0 for seen events. So I think it is a pathological case for SGT probability estimation method and you shouldn't use SGT for such data.

We may detect such cases in _renormalize() method and issue a warning (raising an exception seems too harsh).

There is also #383 which may cause troubles even if there are events with frequency > 1 - make sure you set bins parameter manually.

htInEdin · 2014-02-11T15:01:58Z

Mikhail Korobov writes:

Both Good Turing and Simple Good Turing probability estimation
methods assign Nr(1)/N probability mass to unknown events, and
when all events are single this mass would be 1.0, leaving 0.0 for
seen events. So I think it is a pathological case for SGT
probability estimation method and you shouldn't use SGT for such
data.

That's just not a sufficient answer when NLTK ships with SGT as the
default estimator for NgramModel, and almost any NgramModel built from
nltk corpora will contain such cases.

We may detect such cases in _renormalize() method and issue a
warning (raising an exception seems too harsh).

Shouldn't the warning be accompanied by a useful return value?
Returning 0 will, inter alia, cause an Exception if someone has
called logprob.

There is also #383 which may
cause troubles even if there are events with frequency > 1 - make
sure you set bins parameter manually.

Again, if this is true then this default (from ngram.py)

def _estimator(fdist, bins):
"""
Default estimator function using a SimpleGoodTuringProbDist.
"""
# can't be an instance method of NgramModel as they
# can't be pickled either.
return SimpleGoodTuringProbDist(fdist)

is broken and should be replaced.

I've failed to find relevant advice/workarounds yet, will keep
looking. . .

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

kmike · 2014-02-11T15:54:42Z

I think you are right and SimpleGoodTuring shouldn't be used for ConditionalProbDist (which NgramModel uses), especially as a default. There are many reported issues with NgramModel: GH-133, GH-157, GH-167, GH-367, GH-380, GH-388, GH-396, but unfortunately it seems nobody is actively working on fixing them right now.

As a workaround you may try using LidstoneProbDist. I'm not an user of NgramModel myself, but here is an example from the docstring that should work:

>>> from nltk.corpus import brown
>>> from nltk.probability import LidstoneProbDist
>>> est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
>>> lm = NgramModel(3, brown.words(categories='news'), estimator=est)

stevenbird · 2014-02-16T09:35:02Z

In view of all these unresolved issues with the model package, and the fact that fixes will probably impact the API, I would like to drop the package from the toolkit. It only contains about 100 lines of code, and we should probably start over. It is not mentioned in the NLTK book. We don't seem to have resources to deal with these bugs at present. We need to finalize the API for 3.0 beta. Any objections?

htInEdin · 2014-02-16T10:47:58Z

Steven Bird notifications@github.com writes:

In view of all these unresolved issues with the model package, and
the fact that fixes will probably impact the API, I would like to
drop the package from the toolkit. It only contains about 100 lines
of code, and we should probably start over. It is not mentioned in
the NLTK book. We don't seem to have resources to deal with these
bugs at present. We need to finalize the API for 3.0 beta. Any
objections?

I guess not. I've found a way to set it up to get useable results,
for letter bigrams at least, using a slightly hacked
WittenBellProbDist as the estimator, but it's a hack.

I've started a background task of building a way to compare the
behaviour of NgramModel+[some estimator] with an NgramModel with all
probabilities taken from an ARPA-format model file, which can be
computed offline using SRILM. I'm hoping to use that to debug both
at least some of the buggy ...ProbDist classes, and NgramModel itself.

But that will take a while.

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

stevenbird · 2014-02-16T11:25:21Z

We can add language modelling functionality back in once 3.0 is finalised (and we can flag the API as experimental). I'd be interested in looking for resources to do this -- I think the functionality is important to have, both as a pure python reference implementation, and an interface to 3rd party software for large scale language modelling.

iliakur · 2014-02-16T12:04:00Z

Sorry to jump in on this out of the blue. I may be able to find someone interested in supervising me at school to work on fixing the Ngram model stuff. I've contributed a smoothing class to this package and would hate to see that axed along with everything else.

stevenbird · 2014-02-20T08:46:21Z

Thanks @copper-head that would be very welcome. I'm going to take out the model package while we finalize NLTK 3.0, but I will leave it in a branch of the same name, and welcome pull requests anytime. Once the issues are dealt with, it can be reinstated. Feel free to discuss issues with the package on nltk-dev anytime, thanks.

…167, #367, #380, #388, #396, #602, #603; package saved in a new branch

alvations · 2014-11-10T12:56:25Z

Here's a possible implementation for Good-Turing: http://phoenix.cs.clemson.edu/ldong/tutorial.php

iliakur · 2014-11-17T14:39:01Z

Hi all, I'm sorry for disappearing after saying I'd look into this. I finally have some time on my hands.
@stevenbird, let me know if this question is better suited for the nltk-dev mailing list, but should we maybe reconsider using SGT as the default estimator/smoother for the Ngram model? It makes more intuitive sense to me to have something much more basic as the default and then allow folks to load SGT if they so choose. We'd still have to fix SGT, but at least providing some minimal functionality may allow us to get the model package back online?

alvations · 2014-11-17T15:20:21Z

If there's a vote on this, I would like +1 laplace and +0.001 lidstone smoothing as the default for Ngram models and still have SGT as a possible option when initializing the ngram model.

For scikit-learn, laplace/lidstone smoothing is the default estimator (e.g. http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

htInEdin · 2014-11-17T16:17:42Z

Ilia Kurenkov writes:

@stevenbird, let me know if this question is better suited for the
nltk-dev mailing list, but should we maybe reconsider using SGT as
the default estimator/smoother for the Ngram model? It makes more
intuitive sense to me to have something much more basic as the
default and then allow folks to load SGT if they so choose. We'd
still have to fix SGT, but at least providing some minimal
functionality may allow us to get the model package back online?

+1 for using a simpler default estimator -- it's what I do with our
students.

ht

   Henry S. Thompson, School of Informatics, University of Edinburgh
  10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
            Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk
                   URL: http://www.ltg.ed.ac.uk/~ht/

[mail from me always has a .sig like this -- mail without it is forged spam]

stevenbird · 2015-01-19T04:13:53Z

Would anyone following this thread like to take a look at the remaining issues with #800 please?

…ltk#157, nltk#167, nltk#367, nltk#380, nltk#388, nltk#396, nltk#602, nltk#603; package saved in a new branch

kmike mentioned this issue Feb 11, 2014

Don't use SimpleGoodTuring as a default for NgramModel #603

Closed

stevenbird added a commit that referenced this issue Feb 20, 2014

removed ngram model package, pending resolution of issues #133, #157, #…

73f7e7b

…167, #367, #380, #388, #396, #602, #603; package saved in a new branch

stevenbird added the model label Feb 20, 2014

iliakur mentioned this issue Dec 2, 2014

A stab at fixing the ngram model #800

Merged

smilli mentioned this issue Apr 5, 2015

Add warning for when SimpleGoodTuring doesn't work #938

Merged

stevenbird assigned longdt219 May 12, 2015

inteldict pushed a commit to inteldict/nltk that referenced this issue Jul 15, 2015

removed ngram model package, pending resolution of issues nltk#133, n…

052b5ab

…ltk#157, nltk#167, nltk#367, nltk#380, nltk#388, nltk#396, nltk#602, nltk#603; package saved in a new branch

hoontw mentioned this issue Oct 7, 2015

Create API for language model #1159

Closed

iliakur mentioned this issue Mar 25, 2016

Discussion: Resurrecting the Ngram Model #1342

Closed

stevenbird added the inactive label Aug 22, 2019

stevenbird closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serious problem with SimpleGoodTuringProbDist #602

Serious problem with SimpleGoodTuringProbDist #602

htInEdin commented Feb 10, 2014

stevenbird commented Feb 10, 2014

kmike commented Feb 11, 2014

htInEdin commented Feb 11, 2014

kmike commented Feb 11, 2014

stevenbird commented Feb 16, 2014

htInEdin commented Feb 16, 2014

stevenbird commented Feb 16, 2014

iliakur commented Feb 16, 2014

stevenbird commented Feb 20, 2014

alvations commented Nov 10, 2014

iliakur commented Nov 17, 2014

alvations commented Nov 17, 2014

htInEdin commented Nov 17, 2014

stevenbird commented Jan 19, 2015

Serious problem with SimpleGoodTuringProbDist #602

Serious problem with SimpleGoodTuringProbDist #602

Comments

htInEdin commented Feb 10, 2014

stevenbird commented Feb 10, 2014

kmike commented Feb 11, 2014

htInEdin commented Feb 11, 2014

ht

kmike commented Feb 11, 2014

stevenbird commented Feb 16, 2014

htInEdin commented Feb 16, 2014

ht

stevenbird commented Feb 16, 2014

iliakur commented Feb 16, 2014

stevenbird commented Feb 20, 2014

alvations commented Nov 10, 2014

iliakur commented Nov 17, 2014

alvations commented Nov 17, 2014

htInEdin commented Nov 17, 2014

ht

stevenbird commented Jan 19, 2015