Transfer-Learning-for-NLP-with-TensorFlow-Hub

Project Overview

In this project, the aim is to address the challenge of improving the quality of online discussions by focusing on the detection of toxic content. As a case study, the analysis of questions asked on Quora has been chosen -a platform known for its diverse user-generated content. The goal is to develop a predictive model that can accurately label questions as either sincere or insincere.

Dataset Information

The dataset used in this project is the Quora Insincere Questions Classification dataset, which can be accessed on Kaggle. I use a fraction of the Quora Insincere Questions dataset (train.csv) due to the extensive pre-training of the text classification models. However, you can choose to utilize the entire dataset if needed.

Insincere questions are those that exhibit characteristics such as a non-neutral tone, disparaging language, the inclusion of inflammatory content based on false information, or the presence of sexually explicit content intended to shock or provoke. By identifying and flagging these insincere questions, we can contribute to creating a healthier and more respectful online environment.

Class Imbalance Issue

There is a noticeable class imbalance issue in this dataset. The majority of the questions are labeled as sincere or non-toxic, while the number of insincere questions is comparatively smaller.

To address this class imbalance problem, various strategies can be employed, such as under-sampling the majority class or over-sampling the minority class using different algorithms or techniques. However, for the scope of this project, I have decided not to specifically address this problem within this notebook.

Instead, I will utilize a stratified sampling strategy. This approach assumes that the class imbalance observed in the dataset is reflective of the real-world distribution. Therefore, when creating training and validation splits, I sample data in a way that maintains this imbalance within both the training and validation sets. This ensures that the proportions of sincere and insincere questions remain consistent in the training and validation data, allowing the model to learn and generalize effectively.

Inspiration

This project draws inspiration from the Coursera project Transfer Learning for NLP with TensorFlow Hub, with modifications to enhance visualization and conceptual understanding.

Methodology

I utilize pre-trained models from TensorFlow Hub with tf.keras for text classification.
Transfer learning enables fine-tuning models on text data, saving training resources and achieving good model generalization.
Model performance metrics are visualized using TensorBoard.

Why Transfer Learning in NLP?

Transfer learning leverages shared knowledge about language across NLP tasks, improving model performance and efficiency. This is because many NLP tasks share common linguistic representations and structural similarities in language. When performing these tasks, they can inform and benefit from each other, making transfer learning a powerful approach in NLP.

Text Data and Representations

This dataset comprises questions paired with corresponding labels. To train the statistical classification model effectively, I use question vectors as distributed representations of the questions. These question vectors, along with their corresponding labels, are employed during training to build and fine-tune the model.

Check GPU Availability
Importing Necessary Libraries
Download and Import Dataset
Text Embedding Explanations and TensorFlow Hub
Define Function to Build and Compile Models
Train Various Text Classification Models (without fine-tuning)
Train Various Text Classification Models (with fine-tuning)
Compare Accuracy and Loss Curves
Visualize Metrics with TensorBoard

Recap

This project demonstrates:

Usage of pre-trained NLP text embedding models from TensorFlow Hub.
Transfer learning and fine-tuning on real-world text data.
Visualization of model performance metrics using Matplotlib, TensorFlow documentation package, and TensorBoard.

Results

References

Quora Insincere Questions Classification Dataset:
- Authors: Alex Ellis, inversion, Julia Elliott, Paula Griffin, William Chen
- Title: Quora Insincere Questions Classification
- Publisher: Kaggle
- Year: 2018
- URL: Quora Insincere Questions Classification
Coursera Project:
- Title: Transfer Learning for NLP with TensorFlow Hub
- Publisher: Coursera
- URL: Transfer Learning for NLP with TensorFlow Hub

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Transfer_Learning_for_NLP_with_TF_Hub.ipynb		Transfer_Learning_for_NLP_with_TF_Hub.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transfer_Learning_for_NLP_with_TF_Hub.ipynb

Transfer_Learning_for_NLP_with_TF_Hub.ipynb

Repository files navigation

Transfer-Learning-for-NLP-with-TensorFlow-Hub

Project Overview

Dataset Information

Class Imbalance Issue

Inspiration

Methodology

Why Transfer Learning in NLP?

Text Data and Representations

Table of Contents

Recap

Results

References

About

Releases

Packages

Languages

Nilabbasi/Transfer-Learning-for-NLP-with-TensorFlow-Hub

Folders and files

Latest commit

History

README.md

README.md

Transfer_Learning_for_NLP_with_TF_Hub.ipynb

Transfer_Learning_for_NLP_with_TF_Hub.ipynb

Repository files navigation

Transfer-Learning-for-NLP-with-TensorFlow-Hub

Project Overview

Dataset Information

Class Imbalance Issue

Inspiration

Methodology

Why Transfer Learning in NLP?

Text Data and Representations

Table of Contents

Recap

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages