-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bigrams #1
Comments
@ronojoy just implemented bi-gram, that paper looks interesting! thanks :) let me know your views on the bigram implementation. |
Yes, this looks good. Things will get cumbersome as you increase the Markov chain order with this approach. Therefore, can you now try to use the NLTK n-gram class to write this for a general n-gram model, with n=1, 2, 3 ... given as a parameter ? This code should not take more than 10 lines. Also, check out the options for smoothing the n-gram model in the NLTK class. How about trying to do this for Indian languages, using Unicode ? Two pointers to help you quick theory : https://sites.google.com/site/gothnlp/links/probability-and-n-grams |
Thanks @ronojoy I used the NLTK library and the implementation is now too simple. Just 2/3 lines. Btw, like you suggested I tried using the Tamil text for training and the results are good. I used Thirukural and one of Bharathiyar's story/poems. I noticed that rarely words are repeating in those texts! so am getting sentences just as how it was in the training texts and just in a different order. I also played around with other corpora available in the NLTK library. Wondering what would be their real world applications. Maybe transliteration / translation / voice recognition engines might use to choose the most probable next word? One more thing I saw was that, when I tried to run the Thirukural corpus on my bigram code and tried to generate text, it basically took a very long time (> 20 minutes or so) then I stopped the script, but the same runs in seconds using the NLTK library, there's huge performance difference. |
@yeskarthik, just had a look at your code and here are some more suggestions to improve the model :
N-gram models have tons of applications in NLP. They are usually the first port of call for simple classification tasks. For instance, a naive Bayes classifier invokes a n-gram model with N = 1 to compute word probabilities. Suggestions for next words (e.g. on a search engine) are also generated by N-gram models. Likewise for the T9 algorithm, etc. I haven't looked into the NLTK N-gram method implemention. An optimised data structure of off-loading to C code could be possible reasons why they get so much of speedup. |
Thanks @ronojoy, I was reading about the smoothing that I found here
While I tried implementing the Witten-Bell smoother I also found that NLTK has removed 'models' including NgramModel from its latest version (develop) since it has a number of bugs unresolved including the one that I faced while implementing it. (I got a division by zero error) Do you have any other library in mind? Refs:
|
@yeskarthik, they have removed the n-gram model from the main branch since there are bugs in parts of the code. Faizal (#valuefromdata) and I are planning to work on this over the next several weeks to fix bugs and send in a pull request to NLTK. You are welcome to help out, if you want. The current solution is to roll back to the older version of NLTK, avoid the Lidstone family of smoothers, and generally be careful to check that all returned probabilities are between 0 and 1. You can also talk to Faizal for additional input. The other library which has everything implemented and is generally bug-free is the SRI language modelling toolkit. It is an old-style C library, with command line binary and millions of switches! :) Give it a spin, if only to see how much cooler working in Python is. |
@yeskarthik, nice work. Can you improve the generator by including bigrams and sampling from the bigram probability distribution ? Have a look at this paper for details on how to do this.
The text was updated successfully, but these errors were encountered: