Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NgramModel fix: 3rd time's the charm #937

Merged
merged 10 commits into from
Apr 19, 2015
Merged

NgramModel fix: 3rd time's the charm #937

merged 10 commits into from
Apr 19, 2015

Conversation

iliakur
Copy link
Contributor

@iliakur iliakur commented Apr 4, 2015

Summary

This is a continuation of work started in #800. This time I'm focusing just on finalizing the fix for #367 as well as making a first pass at #396.

Relevant Audience
Input from @stevenbird @rmalouf @bcroy @dan-blanchard @afourney @alvations would be most appreciated!

Outstanding issues

  • SimpleGoodTuring still doesn't work, I haven't had time to test anything other than Laplace and Lidstone, really.
  • pickling works, but NgramModel lacks reasonable comparison methods, so doctests for pickling will have to wait till that's fixed.

@alvations
Copy link
Contributor

There's still some interfacing issues:

from nltk.model import NgramModel
from nltk.corpus import brown

lm = NgramModel(3, brown.words(categories='news'))

[out]:

Traceback (most recent call last):
  File "/home/alvas/git/nltk/test_model.py", line 4, in <module>
    lm = NgramModel(3, brown.words(categories='news'))
  File "/home/alvas/git/nltk/nltk/model/ngram.py", line 113, in __init__
    *estimator_args,
NameError: global name 'estimator_args' is not defined

@iliakur
Copy link
Contributor Author

iliakur commented Apr 4, 2015

@alvations fixed, thanks!

@smilli
Copy link
Contributor

smilli commented Apr 6, 2015

Sorry for jumping on this so late, but are there plans to add alternatives to Katz backoff (interpolation methods)? I'd like to add a cached language model built on NgramModel, but it'd be nice if there were different techniques available for use.

The distinction between a few of the ProbDist classes and language models is very fuzzy to me. For example, KneserNeyProbDist already takes in a frequency distribution of trigrams and computes lower-order language models for estimation. KneserNeyProbDist could easily be modified to start with any order ngrams, instead of just trigrams. Witten-Bell can also be interpolated with lower order models.

@iliakur
Copy link
Contributor Author

iliakur commented Apr 8, 2015

Yea, when I added the KneserNey class there were a) issues with the NgramModel and b) what you pointed about about the distinction being unclear (even then).
I'm at a conference right now, but once I'm back, I'd want to refactor KN to work with any ngram model.
Adding interpolation as an alternative to Katz would also be cool!

@smilli
Copy link
Contributor

smilli commented Apr 8, 2015

Thanks @copper-head. If you're busy, I'd be happy to look into refactoring KN as well :)

@iliakur
Copy link
Contributor Author

iliakur commented Apr 9, 2015

Honestly, I think I'd rather you tackle interpolation backoff in NgramModel while I deal with KN. Basically this is because KN has it's own version of backoff, so we might have to have a discussion how that could interact with what the NgramModel does.

@smilli
Copy link
Contributor

smilli commented Apr 9, 2015

Gotcha, sure thing.

@alvations
Copy link
Contributor

@copper-head , was browsing the ngram.py more thoroughly, this looks to be a very expensive line: https://github.com/Copper-Head/nltk/blob/model/nltk/model/ngram.py#L112

@jonsafari, @bryandeng, is there another way to learn the backoffs that doesn't run through the whole corpus for each order of ngram?

@jonsafari
Copy link

@alvations , you only need to run through the corpus once. You can build up counts of all n-gram orders at the same time. As you're doing that you should also increment global count-of-counts for use in determining the G-T expected counts (alpha). Depending on how you structure your n-gram counts, it can also be easy to determine your discount parameters (d) for Katz backoff. Conceptually it's easy to traverse a compressed trie (or some other tree-like structure) to determine one discount parameter at a time. Other ways are probably faster but less perspicuous (like traversing a hash table in one go and updating the various discount parameters piecemeal).

@alvations
Copy link
Contributor

I've tried out with brown.sents() instead of brown.words() after finding out the bug/feature from the docstring. And now, this is a strange behavior.

from nltk.corpus import brown
from nltk.model import NgramModel


lm = NgramModel(3, brown.sents(categories='news'))
print(lm)

for i in lm:
    print(i)

[out]:

<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>
<LidstoneProbDist based on 0 samples>

@stevenbird stevenbird merged commit c54edec into nltk:model Apr 19, 2015
@stevenbird
Copy link
Member

@alvations, I can't replicate the above, because of an earlier problem:

>>> lm = NgramModel(3, brown.sents(categories='news'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sb/git/nltk/nltk/model/ngram.py", line 107, in __init__
    cfd[context][token] += 1
TypeError: 'int' object is not subscriptable

@alvations
Copy link
Contributor

@stevenbird, we've sort of fixed that 0 samples iterations at PyCon by introducing __iter__ in NgramModel class, i think @copper-head should commit that and some small fixes soon

@ELind77
Copy link

ELind77 commented Feb 29, 2016

@copper-head Sorry to revive an old thread, but you mentioned at the top that you got pickling to work? How? The ProbDists aren't pickleable as far as I can tell.

@iliakur
Copy link
Contributor Author

iliakur commented Mar 30, 2016

@ELind77 We're actually reworking the model to the point where the pickling discussion is more or less moot.
If you have time though, feel free to check out #1342 and #1351 for recent developments. Feedback is most welcome :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants