Skip to content

amitbcp/icdmai_2020

Repository files navigation

A Recurrent Neural Pipeline for Multi-Class | Multi-Label Text Classification

This repository walks you through an end-to-end flow of training Sequence Models (RNNs) along with all the tips/tricks/pointers of which a developer should be aware.

Here, we explain how to frame/handle a Multi-Class | Multi-Label Text Classification problem statement along with it's data preparation pipeline.

Quick Colab Setup

Follow the below steps to setup a Google Colab Workspace and run your experiments :

  1. The Github Repo is: https://github.com/amitbcp/icdmai_2020
  2. The setup notebook in the repo is: 0_setup.ipynb
  3. Kindly use the link below and select the 0_setup.ipynb file to be opened in your personal Google account
  4. Link: https://colab.research.google.com/github/amitbcp/icdmai_2020/blob/master/
  5. Run the Cells in the notebook after connecting to a run-time.
    i. The first cell requests access to your Grdrive to create the appropriate folders/files.
    ii.Please authorize and allow access to the notebook from your google account by copying the verification code that appears.
    iii. The second cell would install the required packages. This should set up your workspace.
    iv. Run the cells to download the data soures required.
  6. From here follow the notebooks in the numbered ordered.
  7. Verify your Drive has a folder ICDMAI_Tutorial/notebook. This should have a couple of notebooks.

Quick Start

Want to play with these notebooks online without having to install anything?

Use any of the following services.

WARNING: Please be aware that these services provide temporary environments: anything you do will be deleted after a while, so make sure you download any data you care about.

Just want to quickly look at some notebooks, without executing any code?

Browse this repository using jupyter.org's notebook viewer:

Note: github.com's notebook viewer also works but it is slower and the math equations are not always displayed correctly.

Problem Statement

Given a post/question from Stackoverflow, predict the Technology Domain & Associated Tags for it. We are working with a only 14 Technology Domain & 112 Tags . Less than 10% data from Stackoverflow questions are used for demonstration purpose only.

The Technology domains are :

  1. Programming
  2. MS-Development Environment
  3. Server Side Development
  4. Mobile App Development
  5. Dev Environment
  6. Front-end/Designing
  7. Dynamic UI
  8. MVC
  9. Dev Ops
  10. Big Data
  11. QA
  12. Project Management
  13. Scripting
  14. Business Analytics

Notebooks

Setup Notebook : Notebook To setup & try-out experiments.
Checklist & Flow for NLP Problems : Best Case Practice while solving Deep Learning problems related to NLP.

  1. Exploratory Data Analysis : Explore the relationships within different groups and labels with the text.
  2. Classical ML Approach : Demonstrate Naive Baise, SVM & Logistic Regression for baseline modeling.
  3. Data Preparation : Creating dataset for RNNs
    a. Standard : Creating group level dataset without handling biasness.
    b. Normalise : Creating group level dataset after clipping data to normalise the distribution.
  4. Word Embedding : Training Custom word-embeddings for our corpus
  5. Model Training : Prototyping & Training LSTMs for Text Classification
  6. Inference Pipeline : Ensembling models for prediction
  7. Visualise Results : Plotting Loss curves & perfomance metrics
  8. Outline : Handling RNN pipelines end-to-end.
  9. Proposal : Proposal to ICDMAI 2020 committee.
  10. Presentation : Presentation used for the session.