- implementation of PV-DM, PV-DBoW models using tensorflow structure (including inferencing new paragraph/sentence). My goal is to learn how to generate good vectors for sentences, paragraphs. Probably a good start point beginners who want to represent texts using neural model.
- text_cnn, text_rnn model implementations referenced from brightmart/text_classification with my personal comments.
- fasttext tutorial #todo
-
evaluations of text representation
- task0: the second experiment as benchmark in Mikolov. etal:2014, using the IMDB dataset.
- IMDB (Large Movie Review Dataset) from "Learning Word Vectors for Sentiment Analysis", Maas-EtAl:2011:ACL-HLT2011
- textrnn and textcnn both perform well on this dataset.
- IMDB (Large Movie Review Dataset) from "Learning Word Vectors for Sentiment Analysis", Maas-EtAl:2011:ACL-HLT2011
- task1: a positive/negative sentiment classification of titles in digital forum.
- in my experiment, textrnn with lstm cells(300) is better than textcnn with 128 filters with sizes range [2,5]. This result maybe my texts from these forums are short, and textcnn performs better in a long paragraph classification most of time.
- task3: 300,000 documents and 100 categories, from Minmin, 2017, "Efficient Vector Representation for Documents through Corruption"(Doc2vecC)
- the author implements this in c++ based on Mikolov's word2vec.
- task0: the second experiment as benchmark in Mikolov. etal:2014, using the IMDB dataset.
-
read in the front
-
word2vec/doc2vec uses a unsupervised method generate text representation. More precisely, you average the word vectors in a sentences while using word2vec as representation of words; and doc2vec generates a vector for each sentence/paragraph right after training with a shallow network in word2vec structures and a doc embedding.
-
fasttext is a method that you can get representation of words with supervised information, such as label of sentence. Usually, you get word vectors which generate for a given task, and averaging these vectors in the sentence gives you representation/features of the sentence. Finally, you use these features as input to train a classifier/regressor for your task.
-
textcnn/textrnn train a slightly deeper model than word2vec/do2vec.
-
Kim, etal 2014 is a classical text classification convolution network. you should read this first.
-
textrnn/textrnn generate representations of sentence too.(during trianing)
- for example, in Kim's paper, after one convolution layer ( 1D-Conv ) and Max pooling layer, you get the representation of a sentence.( a vector in shape [filter_num, 1])
- in textrnn, representation of sentence can be the last hidden layer of lstm, or a average of all the hidden layers' output of lstm cells.
-
-
-
Control the size of your vocabulary
- for chinese, the common size should be 15w.
- usually the initial size of vocabulary size of your corpus might be more than 50w, and you choose a optimal min_count of these word and remove certain stopwords and meaningless words. Then, you get your vocabulary.
- if you don't control vocabulary size, the training might have problem(OOM). And also, these low-occurrance, meaningless words are a interfere during training your nlp tasks.
- fixed error phrases/words in your corpus as more as possible. = ;
- for chinese, the common size should be 15w.
-
unbalanced class of your nlp tasks
-
using class weights # ref
-
there several ways using class_weights in tensorflow
- use a sample_weight for each batch. ( pass sample weight into tf.losses.sparse_softmax_cross_entropy. ) // I use this way.
- use class_weights directly with a weighted logits. ( somebody think it's wrong, # stackoverflow )
- use
tf.nn.weighted_cross_entropy_with_logits()
to handle unbalanced binary classes.
-
-
-
most GPU version tutorial on these shallow neural networks(including mine) are slower than a CPU version (eg. gensim) implementation except for an adjustment on CUDA
- use the original word2vec(Mikolov) is a better choice. # word2vec/dav # google 官方word2vec 中文注释版/tankle
- use gensim with correct parameters.
- doc2vec is the same.
-
text representations
- distributed-representations-of-words-and-phrases-and-their-compositionality_Mikolov_2013
- Distributed Representations of Sentences and Documents_Thomas_Mikolov_2014
- Bag of Tricks for Efficient Text Classification_facebook_Mikolov_2015
- Skip-Thought Vectors_2015
- An Empirical Evaluation of doc2vec with practical insights into document embedding generation_ibm_research_2016
- Enriching Word Vectors with Subword Information
- Efficient Vector Representation for Documents through Corruption_Minmin_2017
-
textcnn
-
textrnn
-
rcnn
- Recurrent Convolutional Neural Networks for Text Classification_AAAI_2015 #ref #Institute of Automation, Chinese Academy of Sciences
- Learning text representation using recurrent convolutional neural network with highway layers_2016
-
Doc2VecC from the paper "Efficient Vector Representation for Documents through Corruption"/mchen24
-
Tutorial for Sentiment Analysis using Doc2Vec in gensim/linanqiu
-
"Distributed Representations of Sentences and Documents" Code?/google forum
( recommend Google Colab with free 12-hour GPU, Tesla K80, tutorial here )
( my .ipynb online: https://drive.google.com/file/d/1WPWM103comn-1kyGv5FXGO-ZC2Lt7ChH/view?usp=sharing )
- 8GB memory, with Nvidia GeForce MX150, compute capability: 6.1
- python3.5
- tensorflow-gpu 1.4.0
- numpy 1.13.1
- jieba 0.39
- t-SNE