Predicting next word with Natural Language Processing

Author: Giuseppe Romagnuolo

Date: 24 April 2016

Motivation

Predictions of what word comes next in a sentence is very handy when writing on portable devices that don't have a full size keyboard.

However the techniques used in texting application is common to a variety of other applications, for example:

genomics by segmenting DNA sequences
speech recognition
automatic language translation
or even as one student in the course suggested music sequence prediction.

And many more.

ShinyApp application

The application I built predicts up to 10 most likely words following of a sentence.

A slider widget allows the user to adjust the number of words suggested from 1 to 10.
The application was trained on 10% of a Blog, News and Twitter corpora provided in the class.
it uses Interpolated Modified Kneser-Ney smoothing algorithm applied on trigrams, bigrams and unigrams.

Interpolated Modified Kneser-Ney

Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discounting with a more sophisticated way to handle the backoff distribution.

The classic example (Jurafsky and Martin 2007) is when faced with a sentence that wasn't seen before like:

"I can't see without my reading ____ ". The word "glasses" seems more appropriate in this context but the word "Francisco" has a higher probability when backing off to a unigram model using MLE.

Kneser-Ney smoothing uses a different backoff distribution to the MLE, the intuition is to base the estimate on the number of different contexts a word w has appeared in.

Free ShinyApp limitation

When it comes to small application the free Shiny server really shines.

However it becomes a little bit more challenging to publish a bundle bigger than a few KB.

In fact, I could not upload 157MB of trained .RData files on to the server.

I eventually had to separate the .RData files and upload to an Amazon server and modify the code so that loaded the data from this remote location.

Unfortunately what before took 5-10sec to initialise it is now taking much longer, however once the .RData is loaded the application is responsive and fun to use.

Acknowledgements and resources

A huge thanks to my fellow students at the John Hopkins Data Science Specialisation. The discussions in the forum have always been inspiring and tremendously helpful.

Resources:

Speech and Language Processing... (Jurafsky and Martin 2007)
Implementation of Modified Kneser-Ney Smoothing... (Korner 2013)
An Introduction to Information Retrieval (Mannin, Raghavan, Shutze 2009)
Bigrams and Trigrams (Fry 2011)
NLP Lunch Tutorial: Smoothing (MacCartney 2005)

Source code

Repository: https://github.com/gromag/nlp
Application: https://gromag.shinyapps.io/Next-Word-Prediction/

Licence

The MIT License (MIT)

Disclaimer

This application is the Capstone Project for the Data Science Specialisation provided by Johns Hopkins Bloomberg School of Public Health via Coursera.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
R.explore		R.explore
R.present		R.present
R.test/dsci-benchmark-master		R.test/dsci-benchmark-master
R.train		R.train
R		R
.gitattributes		.gitattributes
Capstone Project.Rproj		Capstone Project.Rproj
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
Readme.md		Readme.md
server.R		server.R
ui.R		ui.R

License

gromag/Data-Science-Specialisation-Predict-Next-Word

Folders and files

Latest commit

History

Repository files navigation

Predicting next word with Natural Language Processing

Motivation

ShinyApp application

Interpolated Modified Kneser-Ney

Free ShinyApp limitation

Acknowledgements and resources

Source code

Licence

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Languages