GitHub - linontech/text-classification: tutorials of text-classification

Text Classifications

# Contains:

implementation of PV-DM, PV-DBoW models using tensorflow structure (including inferencing new paragraph/sentence). My goal is to learn how to generate good vectors for sentences, paragraphs. Probably a good start point beginners who want to represent texts using neural model.
text_cnn, text_rnn model implementations referenced from brightmart/text_classification with my personal comments.
fasttext tutorial #todo

# todo

evaluations of text representation
- task0: the second experiment as benchmark in Mikolov. etal:2014, using the IMDB dataset.
  - IMDB (Large Movie Review Dataset) from "Learning Word Vectors for Sentiment Analysis", Maas-EtAl:2011:ACL-HLT2011
    - textrnn and textcnn both perform well on this dataset.
- task1: a positive/negative sentiment classification of titles in digital forum.
  - in my experiment, textrnn with lstm cells(300) is better than textcnn with 128 filters with sizes range [2,5]. This result maybe my texts from these forums are short, and textcnn performs better in a long paragraph classification most of time.
- task3: 300,000 documents and 100 categories, from Minmin, 2017, "Efficient Vector Representation for Documents through Corruption"(Doc2vecC)
  - the author implements this in c++ based on Mikolov's word2vec.

# My insights ( for beginners : ;)

read in the front
- word2vec/doc2vec uses a unsupervised method generate text representation. More precisely, you average the word vectors in a sentences while using word2vec as representation of words; and doc2vec generates a vector for each sentence/paragraph right after training with a shallow network in word2vec structures and a doc embedding.
- fasttext is a method that you can get representation of words with supervised information, such as label of sentence. Usually, you get word vectors which generate for a given task, and averaging these vectors in the sentence gives you representation/features of the sentence. Finally, you use these features as input to train a classifier/regressor for your task.
- textcnn/textrnn train a slightly deeper model than word2vec/do2vec.
  - Kim, etal 2014 is a classical text classification convolution network. you should read this first.
  - textrnn/textrnn generate representations of sentence too.(during trianing)
    - for example, in Kim's paper, after one convolution layer ( 1D-Conv ) and Max pooling layer, you get the representation of a sentence.( a vector in shape [filter_num, 1])
    - in textrnn, representation of sentence can be the last hidden layer of lstm, or a average of all the hidden layers' output of lstm cells.
Control the size of your vocabulary
- for chinese, the common size should be 15w.
  - usually the initial size of vocabulary size of your corpus might be more than 50w, and you choose a optimal min_count of these word and remove certain stopwords and meaningless words. Then, you get your vocabulary.
- if you don't control vocabulary size, the training might have problem(OOM). And also, these low-occurrance, meaningless words are a interfere during training your nlp tasks.
- fixed error phrases/words in your corpus as more as possible. = ;
unbalanced class of your nlp tasks
- using class weights # ref
  - there several ways using class_weights in tensorflow
    - use a sample_weight for each batch. ( pass sample weight into tf.losses.sparse_softmax_cross_entropy. ) // I use this way.
    - use class_weights directly with a weighted logits. ( somebody think it's wrong, # stackoverflow )
    - use tf.nn.weighted_cross_entropy_with_logits() to handle unbalanced binary classes.
most GPU version tutorial on these shallow neural networks(including mine) are slower than a CPU version (eg. gensim) implementation except for an adjustment on CUDA
- use the original word2vec(Mikolov) is a better choice. # word2vec/dav # google 官方word2vec 中文注释版/tankle
- use gensim with correct parameters.
- doc2vec is the same.

# Reference

# Learning resources

# Requirements

( recommend Google Colab with free 12-hour GPU, Tesla K80, tutorial here )

( my .ipynb online: https://drive.google.com/file/d/1WPWM103comn-1kyGv5FXGO-ZC2Lt7ChH/view?usp=sharing )

8GB memory, with Nvidia GeForce MX150, compute capability: 6.1
python3.5
tensorflow-gpu 1.4.0
numpy 1.13.1
jieba 0.39
t-SNE

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
_doc2vec		_doc2vec
_textcnn		_textcnn
_textrnn		_textrnn
dataset		dataset
data_helper.py		data_helper.py
lr_model.py		lr_model.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_doc2vec

_doc2vec

_textcnn

_textcnn

_textrnn

_textrnn

dataset

dataset

data_helper.py

data_helper.py

lr_model.py

lr_model.py

readme.md

readme.md

Repository files navigation

Text Classifications

# Contains:

# todo

# My insights ( for beginners : ;)

# Reference

# Learning resources

# Requirements

About

Releases

Packages

Languages

linontech/text-classification

Folders and files

Latest commit

History

Repository files navigation

Text Classifications

# Contains:

# todo

# My insights ( for beginners : ;)

# Reference

# Learning resources

# Requirements

About

Topics

Resources

Stars

Watchers

Forks

Languages