Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bigrams #1

Open
ronojoy opened this issue May 1, 2015 · 6 comments
Open

Add bigrams #1

ronojoy opened this issue May 1, 2015 · 6 comments

Comments

@ronojoy
Copy link

ronojoy commented May 1, 2015

@yeskarthik, nice work. Can you improve the generator by including bigrams and sampling from the bigram probability distribution ? Have a look at this paper for details on how to do this.

@yeskarthik
Copy link
Owner

@ronojoy just implemented bi-gram, that paper looks interesting! thanks :) let me know your views on the bigram implementation.

@yeskarthik yeskarthik reopened this May 1, 2015
@ronojoy
Copy link
Author

ronojoy commented May 1, 2015

Yes, this looks good. Things will get cumbersome as you increase the Markov chain order with this approach. Therefore, can you now try to use the NLTK n-gram class to write this for a general n-gram model, with n=1, 2, 3 ... given as a parameter ? This code should not take more than 10 lines. Also, check out the options for smoothing the n-gram model in the NLTK class. How about trying to do this for Indian languages, using Unicode ? Two pointers to help you

quick theory : https://sites.google.com/site/gothnlp/links/probability-and-n-grams
NLTK ngrams : http://www.nltk.org/api/nltk.model.html

@yeskarthik
Copy link
Owner

Thanks @ronojoy I used the NLTK library and the implementation is now too simple. Just 2/3 lines. Btw, like you suggested I tried using the Tamil text for training and the results are good. I used Thirukural and one of Bharathiyar's story/poems. I noticed that rarely words are repeating in those texts! so am getting sentences just as how it was in the training texts and just in a different order.

I also played around with other corpora available in the NLTK library. Wondering what would be their real world applications. Maybe transliteration / translation / voice recognition engines might use to choose the most probable next word?

One more thing I saw was that, when I tried to run the Thirukural corpus on my bigram code and tried to generate text, it basically took a very long time (> 20 minutes or so) then I stopped the script, but the same runs in seconds using the NLTK library, there's huge performance difference.

@ronojoy
Copy link
Author

ronojoy commented May 19, 2015

@yeskarthik, just had a look at your code and here are some more suggestions to improve the model :

  • increase the n-gram order to 3 and, at a push, try 4
  • definitely add smoothing when you construct the n-gram model. Try the Witten-Bell smoother. Kneser-Ney is supposed to be the best but is not implemented in NLTK. (Want to implement it yourself ?)
  • for fun, you might want to compute the entropy of the model and compare it against English poetry.

N-gram models have tons of applications in NLP. They are usually the first port of call for simple classification tasks. For instance, a naive Bayes classifier invokes a n-gram model with N = 1 to compute word probabilities. Suggestions for next words (e.g. on a search engine) are also generated by N-gram models. Likewise for the T9 algorithm, etc.

I haven't looked into the NLTK N-gram method implemention. An optimised data structure of off-loading to C code could be possible reasons why they get so much of speedup.

@yeskarthik
Copy link
Owner

Thanks @ronojoy, I was reading about the smoothing that I found here

  1. http://www.cs.jhu.edu/~jason/465/PDFSlides/lect05-smoothing.pdf
  2. http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

While I tried implementing the Witten-Bell smoother I also found that NLTK has removed 'models' including NgramModel from its latest version (develop) since it has a number of bugs unresolved including the one that I faced while implementing it. (I got a division by zero error)

Do you have any other library in mind?

Refs:

  1. http://stackoverflow.com/questions/15697623/training-and-evaluating-bigram-trigram-distributions-with-ngrammodel-in-nltk-us (This is the error that I got)
  2. Error in NgramModel backoff smoothing calculation? nltk/nltk#367
  3. http://stackoverflow.com/questions/26443084/is-there-an-alternate-for-the-now-removed-module-nltk-model-ngrammodel

@ronojoy
Copy link
Author

ronojoy commented May 22, 2015

@yeskarthik, they have removed the n-gram model from the main branch since there are bugs in parts of the code. Faizal (#valuefromdata) and I are planning to work on this over the next several weeks to fix bugs and send in a pull request to NLTK. You are welcome to help out, if you want.

The current solution is to roll back to the older version of NLTK, avoid the Lidstone family of smoothers, and generally be careful to check that all returned probabilities are between 0 and 1. You can also talk to Faizal for additional input.

The other library which has everything implemented and is generally bug-free is the SRI language modelling toolkit. It is an old-style C library, with command line binary and millions of switches! :) Give it a spin, if only to see how much cooler working in Python is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants