-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NgramModel fix: 3rd time's the charm #937
Conversation
There's still some interfacing issues:
[out]:
|
@alvations fixed, thanks! |
Sorry for jumping on this so late, but are there plans to add alternatives to Katz backoff (interpolation methods)? I'd like to add a cached language model built on NgramModel, but it'd be nice if there were different techniques available for use. The distinction between a few of the ProbDist classes and language models is very fuzzy to me. For example, KneserNeyProbDist already takes in a frequency distribution of trigrams and computes lower-order language models for estimation. KneserNeyProbDist could easily be modified to start with any order ngrams, instead of just trigrams. Witten-Bell can also be interpolated with lower order models. |
Yea, when I added the KneserNey class there were a) issues with the NgramModel and b) what you pointed about about the distinction being unclear (even then). |
Thanks @copper-head. If you're busy, I'd be happy to look into refactoring KN as well :) |
Honestly, I think I'd rather you tackle interpolation backoff in NgramModel while I deal with KN. Basically this is because KN has it's own version of backoff, so we might have to have a discussion how that could interact with what the NgramModel does. |
Gotcha, sure thing. |
@copper-head , was browsing the @jonsafari, @bryandeng, is there another way to learn the backoffs that doesn't run through the whole corpus for each order of ngram? |
@alvations , you only need to run through the corpus once. You can build up counts of all n-gram orders at the same time. As you're doing that you should also increment global count-of-counts for use in determining the G-T expected counts (alpha). Depending on how you structure your n-gram counts, it can also be easy to determine your discount parameters (d) for Katz backoff. Conceptually it's easy to traverse a compressed trie (or some other tree-like structure) to determine one discount parameter at a time. Other ways are probably faster but less perspicuous (like traversing a hash table in one go and updating the various discount parameters piecemeal). |
I've tried out with
[out]:
|
@alvations, I can't replicate the above, because of an earlier problem:
|
@stevenbird, we've sort of fixed that 0 samples iterations at PyCon by introducing |
@copper-head Sorry to revive an old thread, but you mentioned at the top that you got pickling to work? How? The ProbDists aren't pickleable as far as I can tell. |
Summary
This is a continuation of work started in #800. This time I'm focusing just on finalizing the fix for #367 as well as making a first pass at #396.
Relevant Audience
Input from @stevenbird @rmalouf @bcroy @dan-blanchard @afourney @alvations would be most appreciated!
Outstanding issues
SimpleGoodTuring
still doesn't work, I haven't had time to test anything other than Laplace and Lidstone, really.