In this project, the aim is to address the challenge of improving the quality of online discussions by focusing on the detection of toxic content. As a case study, the analysis of questions asked on Quora has been chosen -a platform known for its diverse user-generated content. The goal is to develop a predictive model that can accurately label questions as either sincere or insincere.
The dataset used in this project is the Quora Insincere Questions Classification dataset, which can be accessed on Kaggle. I use a fraction of the Quora Insincere Questions dataset (train.csv
) due to the extensive pre-training of the text classification models. However, you can choose to utilize the entire dataset if needed.
Insincere questions are those that exhibit characteristics such as a non-neutral tone, disparaging language, the inclusion of inflammatory content based on false information, or the presence of sexually explicit content intended to shock or provoke. By identifying and flagging these insincere questions, we can contribute to creating a healthier and more respectful online environment.
There is a noticeable class imbalance issue in this dataset. The majority of the questions are labeled as sincere or non-toxic, while the number of insincere questions is comparatively smaller.
To address this class imbalance problem, various strategies can be employed, such as under-sampling the majority class or over-sampling the minority class using different algorithms or techniques. However, for the scope of this project, I have decided not to specifically address this problem within this notebook.
Instead, I will utilize a stratified sampling strategy. This approach assumes that the class imbalance observed in the dataset is reflective of the real-world distribution. Therefore, when creating training and validation splits, I sample data in a way that maintains this imbalance within both the training and validation sets. This ensures that the proportions of sincere and insincere questions remain consistent in the training and validation data, allowing the model to learn and generalize effectively.
This project draws inspiration from the Coursera project Transfer Learning for NLP with TensorFlow Hub, with modifications to enhance visualization and conceptual understanding.
- I utilize pre-trained models from TensorFlow Hub with
tf.keras
for text classification. - Transfer learning enables fine-tuning models on text data, saving training resources and achieving good model generalization.
- Model performance metrics are visualized using TensorBoard.
Transfer learning leverages shared knowledge about language across NLP tasks, improving model performance and efficiency. This is because many NLP tasks share common linguistic representations and structural similarities in language. When performing these tasks, they can inform and benefit from each other, making transfer learning a powerful approach in NLP.
This dataset comprises questions paired with corresponding labels. To train the statistical classification model effectively, I use question vectors as distributed representations of the questions. These question vectors, along with their corresponding labels, are employed during training to build and fine-tune the model.
- Check GPU Availability
- Importing Necessary Libraries
- Download and Import Dataset
- Text Embedding Explanations and TensorFlow Hub
- Define Function to Build and Compile Models
- Train Various Text Classification Models (without fine-tuning)
- Train Various Text Classification Models (with fine-tuning)
- Compare Accuracy and Loss Curves
- Visualize Metrics with TensorBoard
This project demonstrates:
- Usage of pre-trained NLP text embedding models from TensorFlow Hub.
- Transfer learning and fine-tuning on real-world text data.
- Visualization of model performance metrics using Matplotlib, TensorFlow documentation package, and TensorBoard.
-
Quora Insincere Questions Classification Dataset:
- Authors: Alex Ellis, inversion, Julia Elliott, Paula Griffin, William Chen
- Title: Quora Insincere Questions Classification
- Publisher: Kaggle
- Year: 2018
- URL: Quora Insincere Questions Classification
-
Coursera Project:
- Title: Transfer Learning for NLP with TensorFlow Hub
- Publisher: Coursera
- URL: Transfer Learning for NLP with TensorFlow Hub