Skip to content

linontech/text-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classifications


# Contains:
  1. implementation of PV-DM, PV-DBoW models using tensorflow structure (including inferencing new paragraph/sentence). My goal is to learn how to generate good vectors for sentences, paragraphs. Probably a good start point beginners who want to represent texts using neural model.
  2. text_cnn, text_rnn model implementations referenced from brightmart/text_classification with my personal comments.
  3. fasttext tutorial #todo
# todo
  • evaluations of text representation

    • task0: the second experiment as benchmark in Mikolov. etal:2014, using the IMDB dataset.
      • IMDB (Large Movie Review Dataset) from "Learning Word Vectors for Sentiment Analysis", Maas-EtAl:2011:ACL-HLT2011
        • textrnn and textcnn both perform well on this dataset.
    • task1: a positive/negative sentiment classification of titles in digital forum.
      • in my experiment, textrnn with lstm cells(300) is better than textcnn with 128 filters with sizes range [2,5]. This result maybe my texts from these forums are short, and textcnn performs better in a long paragraph classification most of time.
    • task3: 300,000 documents and 100 categories, from Minmin, 2017, "Efficient Vector Representation for Documents through Corruption"(Doc2vecC)
      • the author implements this in c++ based on Mikolov's word2vec.

# My insights ( for beginners : ;)
  • read in the front

    • word2vec/doc2vec uses a unsupervised method generate text representation. More precisely, you average the word vectors in a sentences while using word2vec as representation of words; and doc2vec generates a vector for each sentence/paragraph right after training with a shallow network in word2vec structures and a doc embedding.

    • fasttext is a method that you can get representation of words with supervised information, such as label of sentence. Usually, you get word vectors which generate for a given task, and averaging these vectors in the sentence gives you representation/features of the sentence. Finally, you use these features as input to train a classifier/regressor for your task.

    • textcnn/textrnn train a slightly deeper model than word2vec/do2vec.

      • Kim, etal 2014 is a classical text classification convolution network. you should read this first.

      • textrnn/textrnn generate representations of sentence too.(during trianing)

        • for example, in Kim's paper, after one convolution layer ( 1D-Conv ) and Max pooling layer, you get the representation of a sentence.( a vector in shape [filter_num, 1])
        • in textrnn, representation of sentence can be the last hidden layer of lstm, or a average of all the hidden layers' output of lstm cells.

  • Control the size of your vocabulary

    • for chinese, the common size should be 15w.
      • usually the initial size of vocabulary size of your corpus might be more than 50w, and you choose a optimal min_count of these word and remove certain stopwords and meaningless words. Then, you get your vocabulary.
    • if you don't control vocabulary size, the training might have problem(OOM). And also, these low-occurrance, meaningless words are a interfere during training your nlp tasks.
    • fixed error phrases/words in your corpus as more as possible. = ;

  • unbalanced class of your nlp tasks

    • using class weights # ref

      • there several ways using class_weights in tensorflow

        • use a sample_weight for each batch. ( pass sample weight into tf.losses.sparse_softmax_cross_entropy. ) // I use this way.
        • use class_weights directly with a weighted logits. ( somebody think it's wrong, # stackoverflow )
        • use tf.nn.weighted_cross_entropy_with_logits() to handle unbalanced binary classes.

  • most GPU version tutorial on these shallow neural networks(including mine) are slower than a CPU version (eg. gensim) implementation except for an adjustment on CUDA


# Reference

# Learning resources

# Requirements

( recommend Google Colab with free 12-hour GPU, Tesla K80, tutorial here )

( my .ipynb online: https://drive.google.com/file/d/1WPWM103comn-1kyGv5FXGO-ZC2Lt7ChH/view?usp=sharing )

  • 8GB memory, with Nvidia GeForce MX150, compute capability: 6.1
  • python3.5
  • tensorflow-gpu 1.4.0
  • numpy 1.13.1
  • jieba 0.39
  • t-SNE

Releases

No releases published

Packages

No packages published