NgramModel fix: 3rd time's the charm #937

iliakur · 2015-04-04T19:15:04Z

Summary

This is a continuation of work started in #800. This time I'm focusing just on finalizing the fix for #367 as well as making a first pass at #396.

Relevant Audience
Input from @stevenbird @rmalouf @bcroy @dan-blanchard @afourney @alvations would be most appreciated!

Outstanding issues

SimpleGoodTuring still doesn't work, I haven't had time to test anything other than Laplace and Lidstone, really.
pickling works, but NgramModel lacks reasonable comparison methods, so doctests for pickling will have to wait till that's fixed.

alvations · 2015-04-04T21:05:56Z

There's still some interfacing issues:

from nltk.model import NgramModel
from nltk.corpus import brown

lm = NgramModel(3, brown.words(categories='news'))

[out]:

Traceback (most recent call last):
  File "/home/alvas/git/nltk/test_model.py", line 4, in <module>
    lm = NgramModel(3, brown.words(categories='news'))
  File "/home/alvas/git/nltk/nltk/model/ngram.py", line 113, in __init__
    *estimator_args,
NameError: global name 'estimator_args' is not defined

iliakur · 2015-04-04T22:55:15Z

@alvations fixed, thanks!

smilli · 2015-04-06T15:41:06Z

Sorry for jumping on this so late, but are there plans to add alternatives to Katz backoff (interpolation methods)? I'd like to add a cached language model built on NgramModel, but it'd be nice if there were different techniques available for use.

The distinction between a few of the ProbDist classes and language models is very fuzzy to me. For example, KneserNeyProbDist already takes in a frequency distribution of trigrams and computes lower-order language models for estimation. KneserNeyProbDist could easily be modified to start with any order ngrams, instead of just trigrams. Witten-Bell can also be interpolated with lower order models.

iliakur · 2015-04-08T12:11:36Z

Yea, when I added the KneserNey class there were a) issues with the NgramModel and b) what you pointed about about the distinction being unclear (even then).
I'm at a conference right now, but once I'm back, I'd want to refactor KN to work with any ngram model.
Adding interpolation as an alternative to Katz would also be cool!

smilli · 2015-04-08T15:53:40Z

Thanks @copper-head. If you're busy, I'd be happy to look into refactoring KN as well :)

iliakur · 2015-04-09T00:05:08Z

Honestly, I think I'd rather you tackle interpolation backoff in NgramModel while I deal with KN. Basically this is because KN has it's own version of backoff, so we might have to have a discussion how that could interact with what the NgramModel does.

smilli · 2015-04-09T08:42:48Z

Gotcha, sure thing.

alvations · 2015-04-11T18:45:26Z

@copper-head , was browsing the ngram.py more thoroughly, this looks to be a very expensive line: https://github.com/Copper-Head/nltk/blob/model/nltk/model/ngram.py#L112

@jonsafari, @bryandeng, is there another way to learn the backoffs that doesn't run through the whole corpus for each order of ngram?

jonsafari · 2015-04-11T21:03:51Z

@alvations , you only need to run through the corpus once. You can build up counts of all n-gram orders at the same time. As you're doing that you should also increment global count-of-counts for use in determining the G-T expected counts (alpha). Depending on how you structure your n-gram counts, it can also be easy to determine your discount parameters (d) for Katz backoff. Conceptually it's easy to traverse a compressed trie (or some other tree-like structure) to determine one discount parameter at a time. Other ways are probably faster but less perspicuous (like traversing a hash table in one go and updating the various discount parameters piecemeal).

alvations · 2015-04-14T19:14:00Z

I've tried out with brown.sents() instead of brown.words() after finding out the bug/feature from the docstring. And now, this is a strange behavior.

from nltk.corpus import brown
from nltk.model import NgramModel


lm = NgramModel(3, brown.sents(categories='news'))
print(lm)

for i in lm:
    print(i)

[out]:

<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>

stevenbird · 2015-04-19T00:22:45Z

@alvations, I can't replicate the above, because of an earlier problem:

>>> lm = NgramModel(3, brown.sents(categories='news'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sb/git/nltk/nltk/model/ngram.py", line 107, in __init__
    cfd[context][token] += 1
TypeError: 'int' object is not subscriptable

alvations · 2015-04-19T00:26:02Z

@stevenbird, we've sort of fixed that 0 samples iterations at PyCon by introducing __iter__ in NgramModel class, i think @copper-head should commit that and some small fixes soon

ELind77 · 2016-02-29T20:15:33Z

@copper-head Sorry to revive an old thread, but you mentioned at the top that you got pickling to work? How? The ProbDists aren't pickleable as far as I can tell.

iliakur · 2016-03-30T19:02:16Z

@ELind77 We're actually reworking the model to the point where the pickling discussion is more or less moot.
If you have time though, feel free to check out #1342 and #1351 for recent developments. Feedback is most welcome :)

iliakur added 9 commits March 21, 2015 12:46

fixed probability bug for good

f064f6e

updated entropy code

baf3570

updated _estimator func

5091db3

docstest now works

40f2776

added sanity checks and cleaned up a bit

e1ba28c

removed unused teardown function

53d6646

removed unused imports

748f3dd

removed estimator args because it's actually impossible to use them

64558f5

added simple doctest file

2934b47

removed problematic reference to *estimator_args.

c54edec

stevenbird merged commit c54edec into nltk:model Apr 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NgramModel fix: 3rd time's the charm #937

NgramModel fix: 3rd time's the charm #937

iliakur commented Apr 4, 2015

alvations commented Apr 4, 2015

iliakur commented Apr 4, 2015

smilli commented Apr 6, 2015

iliakur commented Apr 8, 2015

smilli commented Apr 8, 2015

iliakur commented Apr 9, 2015

smilli commented Apr 9, 2015

alvations commented Apr 11, 2015

jonsafari commented Apr 11, 2015

alvations commented Apr 14, 2015

stevenbird commented Apr 19, 2015

alvations commented Apr 19, 2015

ELind77 commented Feb 29, 2016

iliakur commented Mar 30, 2016

NgramModel fix: 3rd time's the charm #937

NgramModel fix: 3rd time's the charm #937

Conversation

iliakur commented Apr 4, 2015

alvations commented Apr 4, 2015

iliakur commented Apr 4, 2015

smilli commented Apr 6, 2015

iliakur commented Apr 8, 2015

smilli commented Apr 8, 2015

iliakur commented Apr 9, 2015

smilli commented Apr 9, 2015

alvations commented Apr 11, 2015

jonsafari commented Apr 11, 2015

alvations commented Apr 14, 2015

stevenbird commented Apr 19, 2015

alvations commented Apr 19, 2015

ELind77 commented Feb 29, 2016

iliakur commented Mar 30, 2016