Skip to content

YEY11/COMP90051_Project1

Repository files navigation

COMP90051 Project 1: Authorship Attribution with Limited Text on Twitter

This is the Project 1 for COMP90051 (Statistical Machine Learning) from the University of Melbourne.

1. What is the task?

Authorship attribution is a common task in Natural Language Processing (NLP) applications, such as academic plagiarism detection and potential terrorist suspects identification on social media. As for the traditional author classification task, the training dataset usually includes the entire corpus of the author’s published work, which contains a large number of examples of standard sentences that might reflect the writing style of the author. However, when it comes to the limited text on social media like Twitter, it brings some challenging problems, such as informal expressions, a huge number of labels, unbalanced dataset and extremely limited information related to identity.

Kaggle

In this project, the task is to predict authors of test tweets from among a very large number of authors found in training tweets, which comes from an in-class Kaggle Competition. Our works include data preprocessing, feature engineering, model selection and ensemble models etc. For more details, please check the project specifications and project report.

2. Data

The Data folder contains both original data and processed data.

2.1. Original Data

train_tweets.txt

The original training dataset which contains 328932 tweets posted by 9297 users.

original training data

test_tweets_unlabeled.txt

The original test dataset which contains 35437 tweets posted by the same user group in the training dataset.

original training data

2.2. Processed Data

The preprocess.py in the Code folder transfered the original data into processed data. For example:

original training data

all_clean_data.csv

The entire processed training dataset which contains 328932 tweets posted by 9297 users.

test_clean_data.csv

The entire processed test dataset which contains 35437 tweets posted by the same user group in the training dataset.

train.csv

The random 9/10 processed training dataset used for partial training dataset.

train.csv

The random 1/10 processed training dataset used for partial test dataset.

3. Code

3.1. Data Preprcessing and Feature Engineering

preprocess.py

is used for data preprocessing including removing non-English characters (e.g. emoticons and punctuations) and stopwords, as well as word tokenization and lemmatization based on nltk package. Also, it provides some distribution plots for data based on matplotlib package.

numplotboxplot

Before entering data into the models,using TF-IDF to transfer clean tweets text into a vector or matrix. This process is implemented by CountVectorizer and TfidfTransformer modules from scikit-learn package.

3.2. Model Selection

Five machine learning/deep learning models based on scikit-learn and Keras are implemented in this part, including the Multinomial Naive Bayes, KNN, Multiple Logistic Regression, Linear SVC and LSTM.

  • nb.py - Multinomial Naive Bayes Model.
  • knn1.py - KNN Model.
  • mlr.py - Multiple Logistic Regression Model.
  • svc.py - Linear Support Vector Classifier Model.
  • lstm.py - LSTM Model.

3.3. Ensemble Learning

ensemble.py

Ensemble learning is a powerful technique to increase accuracy on a most of machine learning tasks. In this project, we try a simple ensemble approach called weighted voting to avoid overfitting and improve performance. The basic thought of this method is quite simple. For each prediction from the results of different models, we give them a weight corresponding to their individual accuracy in the previous stage. If the predicted labels of two models are the same, we just add their weight together. Then we select the prediction with highest weight as the final prediction.

Considering the individual performance of the previous models, we try three different combinations:

  • linearSVC + MultinomialNB + KNN(K=1)
  • linearSVC + MultinomialNB + MLR
  • linearSVC + MultinomialNB + MLR + KNN(K=1).

4. Future Works

Due to the limitation of time, we have some ideas which might be worthy but have not yet to try:

  • Using SMOTE algorithm to deal with the unbalanced training dataset.
  • Hyper-parameter optimization based on grid search technique.
  • Adjusting the weight of penalty item. Giving large value when the prediction for minority class is wrong and small value when the prediction for majority class is wrong.
  • Some more complicated but powerful ensemble learning methods which can be found here.

About

This is the Project 1 for COMP90051 (Statistical Machine Learning) from the University of Melbourne.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published