Text_classification

Overview

The repository implements the common algorithms for multi-class text classification. Note that it's just prototypes for experimental purposes only

Word or char level representation: chi-square+tfidf, word2vec, glove, fasttext, elmo, bert, or concated one
Model: CNN, BiLSTM, Self-attention,C-LSTM, RCNN, Capsule, HAN, SVM, XGBoost
Multi task learning: for more than one multi_labels

pip install -r requirements.txt

python run_classifier.py

in config.py, set the new_data=True, -> generate the ./data/*.tf_record -> utilize config.py parameters
in config.py, set the new_data=False, -> utilize the data from ./data/*.tf_record -> utilize config.json parameters

word2vec Chinese pretrained download
fasttext Chinese pretrained download
bert Chinese pretrained download from google
tips: make sure the text use the similar preprocessing trick like segmentation as the pretrained material
create a word2vec pretrained model reference

The classification is used to clarify the damaged part and damage type from vehicles comments

Check in tensorboard: tensorboard --logdir=./outputs
Due to we have too many categories of labels (ca. 500 class for 100,000 examples), and they are not equally important, so we don’t use Macro- evaluation. And the Micro- precision/recall/F1 is the same for multi-label classification. So we check the accuracy and weighted F1.

Sometimes in one sample, more than one label are valid
Some labels have hierarchy relationship
imbalance issue: weighted loss, data argument, anomaly detection, upsampling and downsampling

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
data		data
logs		logs
models		models
models_archives		models_archives
outputs		outputs
pretrained		pretrained
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.json		config.json
config.py		config.py
ensemble.py		ensemble.py
prepare_inputs.py		prepare_inputs.py
prepare_models.py		prepare_models.py
requirements.txt		requirements.txt
run_classifier.py		run_classifier.py
run_predict.py		run_predict.py
tokenization.py		tokenization.py