Skip to content

UW-MSDS-DATA-598-Reproducibility-WI20/goel-modi-moroney-ramprasad-replication-project

Repository files navigation

Replication of Comparative Studies of Detecting Abusive Language on Twitter

This is a repository for the replication project for the Winter 2020 Data Reproducibility course in the Master of Data Science program at University of Washington.

Launch Rstudio Binder Build Docker Container DOI

CONTRIBUTORS

  1. Aboli Moroney ORCID iD icon

  2. Harini Ram Prasad ORCID iD icon

  3. Mayank Goel ORCID iD icon

  4. Samarth Modi ORCID iD icon

CONTENTS

Lately, there has been a lot of effort and research on identifying content that is abusive or offensive on online and social media. Twitter recently published a relatively large and reliable dataset on ‘Hate and Abusive Speech on Twitter’. As Data Scientists, we understand the need to find the best methods and data for identifying such content and flagging it as inappropriate.

In this repository, our aim is to replicate some of the findings in a research paper that performs a comparative study and provides suggestions for using additional features and data for improving such classification of hate and abusive speech using Twitter data. Using the data and code provided by the authors, we aim to replicate the efficacy and accuracy of Logistic Regression model presented in this paper. The original paper had a comparative study of 5 different machine learning and deep learning algorithms. However, for our replication purpose we chose Logistic Regression model using word-level features as the authors have stated that this model outperformed all the machine learning techniques and had an F1-score which was equivalent to the best CNN model. For our project, we also had limited computational resources due to which execution of other machine learning and deep learning models was out of scope.

Citation: Lee, Y., Yoon, S., & Jung, K. (2018). Comparative studies of detecting abusive language on twitter. arXiv preprint arXiv:1808.10245.

URL: https://arxiv.org/abs/1808.10245

Git Repository: https://github.com/younggns/comparative-abusive-lang/blob/master/README.md

DATA

All data files required for our replication project can be found in the 'data' directory in this repository. URL: https://github.com/UW-MSDS-DATA-598-Reproducibility-WI20/goel-modi-moroney-ramprasad-replication-project/tree/master/Data

This directory contains all details about the original data which was used by the authors of the research as well as the data which was sampled and processed for this replication study. Please refer the README.md in the data directory for additional details.

ANALYSIS

The analysis directory contains the R Markdown report detailing the procedure and results of this replication study. This directory also contains the intermediate outputs, R scripts, data and images required to Knit the R Markdown report file successfully. For additional details, please refer the README.md in this directory. URL: https://github.com/UW-MSDS-DATA-598-Reproducibility-WI20/goel-modi-moroney-ramprasad-replication-project/tree/master/analysis

DEPENDENCIES

OS type and version: Windows 10 Pro, Version 1903, OS build 18362.535

System type: 64-bit OS, x64-based processor

R version: >=3.6.2

R packages and versions:

R Package Version
CARET 6.0-84
future 1.16.0
tm 0.7-7
quanteda 1.5.2
Liblinear 2.10-8
stringr 1.4.0
here 0.1
ggplot2 3.2.1
wordcloud 2.6
bookdown 0.17
dplyr 0.8.3
knitr 1.28

LICENSE

The project is licensed as MIT. Please read our license details.

Text and Figures : MIT + file LICENSE Code : MIT + file LICENSE Data : MIT + file LICENSE

CONTRIBUTING

We welcome contributions from everyone. If you would like to make a contribution, please read our contributor guidelines. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published