Skip to content

LongxingTan/Text-classification

Repository files navigation

Text_classification

Overview

The repository implements the common algorithms for multi-class text classification. Note that it's just prototypes for experimental purposes only

  • Word or char level representation: chi-square+tfidf, word2vec, glove, fasttext, elmo, bert, or concated one
  • Model: CNN, BiLSTM, Self-attention,C-LSTM, RCNN, Capsule, HAN, SVM, XGBoost
  • Multi task learning: for more than one multi_labels

Dependencies

pip install -r requirements.txt

  • Python 3.6
  • Tensorflow 1.12.0

Usage

python run_classifier.py

  • in config.py, set the new_data=True, -> generate the ./data/*.tf_record -> utilize config.py parameters
  • in config.py, set the new_data=False, -> utilize the data from ./data/*.tf_record -> utilize config.json parameters

Pretrained

Purpose

  • The classification is used to clarify the damaged part and damage type from vehicles comments

Evaluation

  • Check in tensorboard: tensorboard --logdir=./outputs
  • Due to we have too many categories of labels (ca. 500 class for 100,000 examples), and they are not equally important, so we don’t use Macro- evaluation. And the Micro- precision/recall/F1 is the same for multi-label classification. So we check the accuracy and weighted F1.

Ignored property

  • Sometimes in one sample, more than one label are valid
  • Some labels have hierarchy relationship
  • imbalance issue: weighted loss, data argument, anomaly detection, upsampling and downsampling

Todo

  • Multi-task learning
  • Multi-label classification