nlpred

Provides support for making predictions of binary outcomes based on natural language corpora. Formally, creates predictive models using Logistic Regression with Elastic Net Regularization on topic models derived from Latent Dirichlet Allocation. Supports multithreading for increased runtime efficiency on large data sets.

A black box method, getModels(), is provided so that users who do not wish to manually tweak their models can create high-quality models by simply providing two vectors of documents (Strings): one vector of documents associated with true responses, and one vector of documents associated with false responses.

Quick Start Example

# Note that all of the below methods are contained in Model.py of nlpred; appropriate imports are required.

# Declare a vector of Strings containing documents in the test/train set associated with TRUE reponses
trueDocs = ["true document 1 content", "true document 2 content", "true document 3 content"]

# Declare a vector of Strings containing documents in the test/train set associated with FALSE reponses
falseDocs = ["false document 1 content", "false document 2 content", "false document 3 content"]

# Call getModels() from Model.py to generate predictive models for the documents
modelDict = getModels(trueDocs, falseDocs, isRedditData = False)

# modelDict is now a Dictionary containing five candidates for an optimal model, stored as Model objects:
#   1) "AUC", the model with the highest AUC
#   2) "acc", the model with the highest accuracy
#   3) "rec", the model with the highest recall
#   4) "acc_1se", the model with the highest accuracy among models whose AUC is within 1 standard error of the maximum
#   5) "rec_1se", the model with the highest recall among models whose AUC is within 1 standard error of the maximum

The "best" model depends on the original corpus and purpose of the model, so it is left to the user to determine which of the five models returned by this method best suits their purposes.

If the user is unsure of which model to use, it would behoove them to simply choose the model with the highest AUC, stored as "AUC" in the return Dictionary.

Model Object Functionality

To illustrate what can be done with a Model object, assume the model with the highest AUC was chosen as optimal:

optimalModel = modelDict['AUC']

The model can be saved to disk as a .pkl file in the current work directory:

optimalModel.save(fileName = "optimal_model.pkl")

Models may be manually loaded into the python environment from a .pkl file:

loadedModel = loadModel(fileName = "optimal_model.pkl")

The model can be used to make predictions from new documents, passed as a vector of Strings:

newDocs = ["new document 1 content", "new document 2 content", "new document 3 content"]

predictionDataFrame = optimalModel.predict(listOfDocStrings = newDocs)

# predictionDataFrame is now  a DataFrame containing three columns:
#   1)  'paper_text', the original input textual data
#   2)  'paper_text_processed', the processed (tokenized with punctuation and capitalization removed) textual data
#   3)  'value', the 1/0 (True/False) prediction for the 'paper_text' entry in the same row

Alternatively, predictions can be made directly from a .pkl file without having to load the model manually:

predictionDataFrame = predictFromFile(fileName = "optimal_model.pkl", listOfDocStrings = newDocs)

A summary of the model may be printed to stdout using the python standard print() function:

print(optimalModel)

Alternatively, a summary of the model may be printed to stdout with the instance method print():

optimalModel.print()

Instance variables of the Model object may be obtained by the standard extensions, e.g.:

optimalAUC = optimalModel.AUC

The instance variables accessible to the user are as follows:

Variable	Description
AUC	A general measure of fit for the model as per the chosen predictive threshold for "true" vs "false" from logreg predictive output.
accuracy	The classification rate, i.e. the percent of correct predictions; or, the number of correct predictions made divided by the total number of predictions made.
recall	a.k.a sensitivity; P(Y=1 \| X=1), the proportion of true samples correctly identified as true. I.e., the number of true samples correctly identified divided by the total number of true samples. The recall is intuitively the ability of the classifier to find all the positive samples.
confusion_matrix	The confusion matrix of the fitted model.
topicQuantity	The number of topics used in .this model.
fit	The fitted logistic regression model itself, of type sklearn.linear_model.LogisticRegressionCV
topicLDAModel	The LDA object for the fitted topic model.
id	A unique identifier for the model. Note that this is a UUID object. The actual id may be found in “id.hex”. The UUID wrapper object is kept for aesthetic reasons with regard to printing, etc.

Full Documentation

For full documentation, please see documentation.pdf in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
nlpred		nlpred
tests		tests
LICENSE		LICENSE
README.md		README.md
documentation.pdf		documentation.pdf
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

nlpred

nlpred

tests

tests

LICENSE

LICENSE

README.md

README.md

documentation.pdf

documentation.pdf

setup.py

setup.py

Repository files navigation

nlpred

Quick Start Example

Model Object Functionality

Full Documentation

About

Languages

License

JWLevesque/nlpred

Folders and files

Latest commit

History

Repository files navigation

nlpred

Quick Start Example

Model Object Functionality

Full Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages