GitHub - YEY11/COMP90051_Project1: This is the Project 1 for COMP90051 (Statistical Machine Learning) from the University of Melbourne.

COMP90051 Project 1: Authorship Attribution with Limited Text on Twitter

This is the Project 1 for COMP90051 (Statistical Machine Learning) from the University of Melbourne.

1. What is the task?

Authorship attribution is a common task in Natural Language Processing (NLP) applications, such as academic plagiarism detection and potential terrorist suspects identification on social media. As for the traditional author classification task, the training dataset usually includes the entire corpus of the author’s published work, which contains a large number of examples of standard sentences that might reflect the writing style of the author. However, when it comes to the limited text on social media like Twitter, it brings some challenging problems, such as informal expressions, a huge number of labels, unbalanced dataset and extremely limited information related to identity.

In this project, the task is to predict authors of test tweets from among a very large number of authors found in training tweets, which comes from an in-class Kaggle Competition. Our works include data preprocessing, feature engineering, model selection and ensemble models etc. For more details, please check the project specifications and project report.

2. Data

The Data folder contains both original data and processed data.

2.1. Original Data

train_tweets.txt

The original training dataset which contains 328932 tweets posted by 9297 users.

test_tweets_unlabeled.txt

The original test dataset which contains 35437 tweets posted by the same user group in the training dataset.

2.2. Processed Data

The preprocess.py in the Code folder transfered the original data into processed data. For example:

all_clean_data.csv

The entire processed training dataset which contains 328932 tweets posted by 9297 users.

test_clean_data.csv

The entire processed test dataset which contains 35437 tweets posted by the same user group in the training dataset.

train.csv

The random 9/10 processed training dataset used for partial training dataset.

train.csv

The random 1/10 processed training dataset used for partial test dataset.

3. Code

3.1. Data Preprcessing and Feature Engineering

preprocess.py

is used for data preprocessing including removing non-English characters (e.g. emoticons and punctuations) and stopwords, as well as word tokenization and lemmatization based on nltk package. Also, it provides some distribution plots for data based on matplotlib package.

Before entering data into the models，using TF-IDF to transfer clean tweets text into a vector or matrix. This process is implemented by CountVectorizer and TfidfTransformer modules from scikit-learn package.

3.2. Model Selection

Five machine learning/deep learning models based on scikit-learn and Keras are implemented in this part, including the Multinomial Naive Bayes, KNN, Multiple Logistic Regression, Linear SVC and LSTM.

nb.py - Multinomial Naive Bayes Model.
knn1.py - KNN Model.
mlr.py - Multiple Logistic Regression Model.
svc.py - Linear Support Vector Classifier Model.
lstm.py - LSTM Model.

3.3. Ensemble Learning

ensemble.py

Ensemble learning is a powerful technique to increase accuracy on a most of machine learning tasks. In this project, we try a simple ensemble approach called weighted voting to avoid overfitting and improve performance. The basic thought of this method is quite simple. For each prediction from the results of different models, we give them a weight corresponding to their individual accuracy in the previous stage. If the predicted labels of two models are the same, we just add their weight together. Then we select the prediction with highest weight as the final prediction.

Considering the individual performance of the previous models, we try three different combinations:

linearSVC + MultinomialNB + KNN(K=1)
linearSVC + MultinomialNB + MLR
linearSVC + MultinomialNB + MLR + KNN(K=1).

4. Future Works

Due to the limitation of time, we have some ideas which might be worthy but have not yet to try:

Using SMOTE algorithm to deal with the unbalanced training dataset.
Hyper-parameter optimization based on grid search technique.
Adjusting the weight of penalty item. Giving large value when the prediction for minority class is wrong and small value when the prediction for majority class is wrong.
Some more complicated but powerful ensemble learning methods which can be found here.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Code		Code
Data		Data
Result		Result
.DS_Store		.DS_Store
NaiveBayers.ipynb		NaiveBayers.ipynb
PreProcess.ipynb		PreProcess.ipynb
Project Report Team 52.pdf		Project Report Team 52.pdf
Project specifications.pdf		Project specifications.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code

Code

Data

Data

Result

Result

.DS_Store

.DS_Store

NaiveBayers.ipynb

NaiveBayers.ipynb

PreProcess.ipynb

PreProcess.ipynb

Project Report Team 52.pdf

Project Report Team 52.pdf

Project specifications.pdf

Project specifications.pdf

README.md

README.md

Repository files navigation

COMP90051 Project 1: Authorship Attribution with Limited Text on Twitter

1. What is the task?

2. Data

2.1. Original Data

2.2. Processed Data

3. Code

3.1. Data Preprcessing and Feature Engineering

3.2. Model Selection

3.3. Ensemble Learning

4. Future Works

About

Releases

Packages

Languages

YEY11/COMP90051_Project1

Folders and files

Latest commit

History

Repository files navigation

COMP90051 Project 1: Authorship Attribution with Limited Text on Twitter

1. What is the task?

2. Data

2.1. Original Data

2.2. Processed Data

3. Code

3.1. Data Preprcessing and Feature Engineering

3.2. Model Selection

3.3. Ensemble Learning

4. Future Works

About

Resources

Stars

Watchers

Forks

Languages