GitHub - NewGround-LLC/psistats: Applying Deep Machine Learning for psycho-demographic profiling of Internet users using O.C.E.A.N. model of personality.

This repository contains the source code accompanying research paper "Applying Deep Machine Learning for psycho-demographic profiling of Internet users using O.C.E.A.N. model of personality" and arXiv preprint arXiv:1703.06914.

Setting up working environment

Dependencies

The source code in this repository is written in R programming language and use TensorFlow framework to accelerate graph computations performed by studied machine learning models. For better performance, it's recommended to install TensorFlow with GPU support.

The installation instructions for TensorFlow library may be found at this page. The installation instructions for R programming language environment described in section 'R Installation and Administration' of manuals at web page.

In order to get access from R environment to TensorFlow framework its necessary to install TensorFlow for R package as described at this page

The R source code has multiple dependencies on third party R packages:

irlba - the package to perform singular value decomposition analysis
optparse - the package providing support for command line arguments parsing
Matrix - the package for handling sparse matrix structures
mice - the package to perform multivariate imputation of missing values
R6 - the package allowing to define R6Class classes with references semantics
ROCR - the package for visualizing the performance of scoring classifiers
tensorflow - the package exposing TensorFlow Python API in R environment

In order to install necessary R packages run the following command in R environment:

> install.packages(c("irlba", "optparse", "Matrix", "mice", "R6", "ROCR"))

The tensorflow bridge package should be installed as described at this manual

Data Corpus

The data corpus used in the research is publicly available and can be requested at dataminingtutorial.com

The data corpus comprise of following files:

users.csv: contains psycho-demographic user profiles. It has 110 728 rows (excluding the row holding column names) and nine columns: anonymized user ID, gender (“0” for male and “1” for female), age, political views (“0” for Democrat and “1” for Republican), and scores of five-factor model of personality (Goldberg et al., 2006).
likes.csv: contains anonymized IDs and names of 1 580 284 Facebook Likes. It has two columns: ID and name.
users-likes.csv: contains the associations between users and their Likes, stored as user–Like pairs. It has 10 612 326 rows and two columns: user ID and Like ID. An existence of a user–Like pair implies that a given user had the corresponding Like on their profile.

Source code structure

The source code consist of R scripts, each encapsulating particular functionality:

config.R - holds common configuration parameters (input, intermediate and output directories, etc)
preprocessing.R - performs raw data corpus preprocessing by creating sparse data matrix, trimming it, and missing data point imputation
analysis.R - encapsulates routines to perform preliminary analysis of data corpus to find correlations between input variables and outputs (heat map) as well as to find optimal number of SVD dimensions (plot number of SVD dimensions against prediction accuracies of regression models per dependent variable)
svd_varimax.R - performs input features' dimensionality reduction using SVD with subsequent varimax rotation in order to simplify SVD dimensions
users_likes_data_set.R - holds data set definition with functions to get batches of train/validation samples
utils.R - provides common utilitites
regression_analysis.R - encapsulate experiment with linear/logistic regression predictive models
nn_analysis.R - encapsulate experiment with predictive models based on artificial neural networks
mlp.R - encapsulates shallow neural network graph creation
dnn.R, 3dnn.R - encapsulates deep neural networks (DNN) graph creation with two and three hidden layers accordingly

Additionality shell scripts provided in order to help with R scripts execution:

eval_mlp_1.sh - to evaluate shallow neural network
eval_dnn.sh - to evaluate DNN with two hidden layers
eval_3dnn.sh - to evaluate DNN with three hidden layers

Running experiments

The detailed instructions how to run experiments present in our research paper. Here we depict only major steps to be done:

The trimmed sparse matrix with users-likes relations must be created using preprocessing.R script.
The optimal number of SVD dimensions to be applied to the created users-likes matrix should be found by executing analysis.R script.
With found optimal number of SVD dimensions the dimensionality reduction should be performed using svd_varimax.R script.
The linear/logistic regression analysis can be performed with regression_analysis.R script using as input the users-likes matrix with reduced features dimensions prepared in previous step.
The experiments with predictive models based on neural networks can be executed by running corresponding shell scripts mentioned above.

Authors

This source code maintained and managed by Iaroslav Omelianenko (NewGround LLC)

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
contents		contents
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_3dnn.sh		eval_3dnn.sh
eval_dnn.sh		eval_dnn.sh
eval_mlp_1.sh		eval_mlp_1.sh
psistats.Rproj		psistats.Rproj
start_tensorboard.sh		start_tensorboard.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contents

contents

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

eval_3dnn.sh

eval_3dnn.sh

eval_dnn.sh

eval_dnn.sh

eval_mlp_1.sh

eval_mlp_1.sh

psistats.Rproj

psistats.Rproj

start_tensorboard.sh

start_tensorboard.sh

Repository files navigation

Setting up working environment

Dependencies

Data Corpus

Source code structure

Running experiments

Authors

Copyright

About

Releases

Packages

Languages

License

NewGround-LLC/psistats

Folders and files

Latest commit

History

Repository files navigation

Setting up working environment

Dependencies

Data Corpus

Source code structure

Running experiments

Authors

Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Languages