BullyDetect (Techniques to Detect Cyberbully)

My final year project at Multimedia University, Cyberjaya. It involves Natural Language Processing to detect cyberbullies using a combination of supervised and unsupervised learning based on the text comment.

Supervised Learning

The dataset used for classification is from Kaggle and the following supervised machine learning algorithms were used:

Random Forest (100 Trees)
Naive Bayes (Gaussian Model)
Support Vector Machines (Linear SVC)

KAGGLE DATASET LINK

EXTRA: Currently trying out the following approaches since the main phases are over:

Fine-tuning parameters of machine learning approaches
Using XGBoost for a try out

Unsupervised Learning

The framework used is Word2Vec Skip-Gram model. The model was trained using comments from the Reddit corpus, from January 2015 to May 2015. Also, K-Means Clustering was used in conjunction with Word2Vec. The skip-gram model is shown below:

REDDIT CORPUS LIST

Methods used

Some of the main methods used are:

Average Words: The most basic approach. Add the feature vectors of words, then divide by the total number of words.
Mean Similarity: Finding the feature vectors of words that are above a mean cosine similarity. This is done by finding the top-n words, and averaging their mean similarity. This is done word-by-word.
Word Feature: Using the mean feature of each specific word, provided it is in the model.
Clustering Word Vectors: Using K-Means Clustering to cluster a group of words together.

Some of the above methods can be combined using the TF-IDF from the Kaggle Dataset

Evaluation and Results

The following evaluation metrics were used after being cross-validated with Stratified 10 Fold Sampling: Accuracy, Precision, False Positive Rate (FPR), Area Under ROC, Log Loss, Brier Score Loss and Run-Time Prediction. Due to the dataset being negatively skewed (about 75% non-bully comments), a lot of importance were put on Precision, FPR, Brier Score Loss, and Run-Time Prediction. The results are divided into two jupyter notebooks, based on two different datasets:

Balanced Dataset: Using an even number of bully and non-bully comments
Imbalanced Dataset: Using the full dataset

Also, for evaluation of Word2Vec can be found here

Tools Used

Python 3.5+ was used as the scripting language, while MongoDB was used to store the comments from reddit. Some of the main libraries used:

Gensim: For Word2Vec.
Scikit-learn: For Machine Learning and Evaluation Metrics.
Regex: For handling character-level expressions in text.

Name		Name	Last commit message	Last commit date
Latest commit History 358 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Confusion Matrix (Balanced)		Confusion Matrix (Balanced)
Confusion Matrix		Confusion Matrix
Kaggle		Kaggle
Python Notebooks		Python Notebooks
TFIDF models		TFIDF models
Word Dictionaries		Word Dictionaries
__pycache__		__pycache__
README.md		README.md
Results (Balanced).xlsx		Results (Balanced).xlsx
Results (Imbalanced).xlsx		Results (Imbalanced).xlsx
Results (Stratified 10 Fold).xlsx		Results (Stratified 10 Fold).xlsx
avg_hybrid.py		avg_hybrid.py
avg_hybrid_tfidf.py		avg_hybrid_tfidf.py
avg_sentence.py		avg_sentence.py
avg_words.py		avg_words.py
avg_words_tfidf.py		avg_words_tfidf.py
balanced_dataset.csv		balanced_dataset.csv
balanced_set.py		balanced_set.py
bow.py		bow.py
bow_tfidf.py		bow_tfidf.py
clean_dataset.csv		clean_dataset.csv
cluster_avg.py		cluster_avg.py
cluster_avg_tfidf.py		cluster_avg_tfidf.py
cluster_freq.py		cluster_freq.py
cluster_pos.py		cluster_pos.py
cluster_train.py		cluster_train.py
cluster_trans.py		cluster_trans.py
contractions.py		contractions.py
dict_cluster.py		dict_cluster.py
evaluation.py		evaluation.py
kaggle.py		kaggle.py
soundex_dict.py		soundex_dict.py
soundex_list.py		soundex_list.py
storage.py		storage.py
tfidf_train.py		tfidf_train.py
train.py		train.py
vector_preprocess.py		vector_preprocess.py
word_avg_hybrid.py		word_avg_hybrid.py
word_feature.py		word_feature.py
word_feature_tfidf.py		word_feature_tfidf.py
word_hybrid_tfidf.py		word_hybrid_tfidf.py

tazeek/BullyDetect

Folders and files

Latest commit

History

Repository files navigation

BullyDetect (Techniques to Detect Cyberbully)

Supervised Learning

Unsupervised Learning

Methods used

Evaluation and Results

Tools Used

About

Topics

Resources

Stars

Watchers

Forks

Languages