Multi-classes task classification and LDA-based topic Recommender System

Here is my winning strategy to carry multi-text classification task out.

Data Source : https://catalog.data.gov/dataset/consumer-complaint-database

1 - Text Mining

Word Frequency Plot: Compare frequencies across different texts and quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between text1 and text2, and between text1 and text3?

Most discriminant and important word per categories
Relationships between words & Pairwise correlations: examining which words tend to follow others immediately, or that tend to co-occur within the same documents.

Which word is associated with another word? Note that this is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In this case, a random generator following this model might spit out “collect”, then “agency”, then “report/credit/score”, by following each word to the most common words that follow it. To make the visualization interpretable, we chose to show only the most common word to word connections, but one could imagine an enormous graph representing all connections that occur in the text.

Distribution of words: Want to show that there are similar distributions for all texts, with many words that occur rarely and fewer words that occur frequently. Here is the goal of Zip Law (extended with Harmonic mean) - Zipf’s Law is a statistical distribution in certain data sets, such as words in a linguistic corpus, in which the frequencies of certain words are inversely proportional to their ranks.

How to spell variants of a given word
Chi-Square to see which words are associated to each category: find the terms that are the most correlated with each of the categories
Part of Speech Tags and Frequency distribution of POST: Noun Count, Verb Count, Adjective Count, Adverb Count and Pronoun Count
Metrics of words: Word Count of the documents – ie. total number of words in the documents, Character Count of the documents – total number of characters in the documents, Average Word Density of the documents – average length of the words used in the documents, Puncutation Count in the Complete Essay – total number of punctuation marks in the documents, Upper Case Count in the Complete Essay – total number of upper count words in the documents, Title Word Count in the Complete Essay – total number of proper case (title) words in the documents

2 - Word Embedding

A - Frequency Based Embedding

Count Vector
TF IDF
Co-Occurrence Matrix with a fixed context window (SVD)
TF-ICF
Function Aware Components

B - Prediction Based Embedding

CBOW (word2vec)
Skip-Grams (word2vec)
Glove
At character level -> FastText
Topic Model as features // LDA features

LDA

Visualization provides a global view of the topics (and how they differ from each other), while at the same time allowing for a deep inspection of the terms most highly associated with each individual topic. A novel method for choosing which terms to present to a user to aid in the task of topic interpretation, in which we define the relevance of a term to a topic.

C - Poincaré Embedding [Embeddings and Hyperbolic Geometry]

The main innovation here is that these embeddings are learnt in hyperbolic space, as opposed to the commonly used Euclidean space. The reason behind this is that hyperbolic space is more suitable for capturing any hierarchical information inherently present in the graph. Embedding nodes into a Euclidean space while preserving the distance between the nodes usually requires a very high number of dimensions.

https://arxiv.org/pdf/1705.08039.pdf https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb

Learning representations of symbolic data such as text, graphs and multi-relational data has become a central paradigm in machine learning and artificial intelligence. For instance, word embeddings such as WORD2VEC, GLOVE and FASTTEXT are widely used for tasks ranging from machine translation to sentiment analysis.

Typically, the objective of embedding methods is to organize symbolic objects (e.g., words, entities, concepts) in a way such that their similarity in the embedding space reflects their semantic or functional similarity. For this purpose, the similarity of objects is usually measured either by their distance or by their inner product in the embedding space. For instance, Mikolov embed words in R^d such that their inner product is maximized when words co-occur within similar contexts in text corpora. This is motivated by the distributional hypothesis, i.e., that the meaning of words can be derived from the contexts in which they appear.

3 - Algorithms

A - Traditional Methods

CountVectorizer + Logistic
CountVectorizer + NB
CountVectorizer + LightGBM
HasingTF + IDF + Logistic Regression
TFIDF + NB
TFIDF + LightGBM
TF-IDF + SVM
Hashing Vectorizer + Logistic
Hashing Vectorizer + NB
Hashing Vectorizer + LightGBM
Bagging / Boosting
Word2Vec + Logistic
Word2Vec + LightGNM
Word2Vec + XGBoost
LSA + SVM

B - Deep Learning Methods

GRU + Attention Mechanism
CNN + RNN + Attention Mechanism
CNN + LSTM/GRU + Attention Mechanism

4 - Explainability

Goal: explain predictions of arbitrary classifiers, including text classifiers (when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved)

Lime
Skate
Shap

5 - MyApp of multi-classes text classification with Attention mechanism

6 - Ressources / Bibliography

All models : https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/
CNN Text Classification: https://github.com/cmasch/cnn-text-classification/blob/master/Evaluation.ipynb
CNN Multichannel Text Classification + Hierarchical attention + …: https://github.com/gaurav104/TextClassification/blob/master/CNN%20Multichannel%20Text%20Classification.ipynb
Notes for Deep Learning https://arxiv.org/pdf/1808.09772.pdf
Doc classification with NLP https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb
Paragraph Topic Classification http://cs229.stanford.edu/proj2016/report/NhoNg-ParagraphTopicClassification-report.pdf
1D convolutional neural networks for NLP https://github.com/Tixierae/deep_learning_NLP/blob/master/cnn_imdb.ipynb
Hierarchical Attention for text classification https://github.com/Tixierae/deep_learning_NLP/blob/master/HAN/HAN_final.ipynb
Multi-class classification scikit learn (Random forest, SVM, logistic regression) https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb
Text feature extraction TFIDF mathematics https://dzone.com/articles/machine-learning-text-feature-0
Classification Yelp Reviews (AWS) http://www.developintelligence.com/blog/2017/06/practical-neural-networks-keras-classifying-yelp-reviews/
Convolutional Neural Networks for Text Classification (waouuuuu) http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ https://github.com/davidsbatista/ConvNets-for-sentence-classification
3 ways to interpretate your NLP model [Lime, ELI5, Skater] https://github.com/makcedward/nlp/blob/master/sample/nlp-model_interpretation.ipynb https://towardsdatascience.com/3-ways-to-interpretate-your-nlp-model-to-management-and-customer-5428bc07ce15 https://medium.freecodecamp.org/how-to-improve-your-machine-learning-models-by-explaining-predictions-with-lime-7493e1d78375
Deep Learning for text made easy with AllenNLP https://medium.com/swlh/deep-learning-for-text-made-easy-with-allennlp-62bc79d41f31
Ensemble Classifiers https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/
**Classification Algorithms ** [tfidf, count features, logistic regression, naive bayes, svm, xgboost, grid search, word vectors, LSTM, GRU, Ensembling] : https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle/notebook
Deep learning architecture [TextCNN, BiDirectional RNN(LSTM/GRU), Attention Models] : https://mlwhiz.com/blog/2019/03/09/deeplearning_architectures_text_classification/ and https://www.kaggle.com/mlwhiz/attention-pytorch-and-keras
CNN + Word2vec and LSTM + Word2Vec : https://www.kaggle.com/kakiac/deep-learning-4-text-classification-cnn-bi-lstm
Comparison of models [Bag of Words - Countvectorizer Features, TFIDF Features, Hashing Features, Word2vec Features] : https://mlwhiz.com/blog/2019/02/08/deeplearning_nlp_conventional_methods/
Embed, encode, attend, predict : https://explosion.ai/blog/deep-learning-formula-nlp
Visualisation sympa pour comprendre CNN : http://www.thushv.com/natural_language_processing/make-cnns-for-nlp-great-again-classifying-sentences-with-cnns-in-tensorflow/
Yelp comments classification [ LSTM, LSTM + CNN] : https://github.com/msahamed/yelp_comments_classification_nlp/blob/master/word_embeddings.ipynb
RNN text classification : https://karpathy.github.io/2015/05/21/rnn-effectiveness/
CNN for Sentence Classification & DCNN for Modelling Sentences & VDNN for Text Classification & Multi Channel Variable size CNN & Multi Group Norm Constraint CNN & RACNN Neural Networks for Text Classification: https://bicepjai.github.io/machine-learning/2017/11/10/text-class-part1.html
Transformers : https://towardsdatascience.com/transformers-141e32e69591
Seq2Seq : https://guillaumegenthial.github.io /sequence-to-sequence.html
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) : https://jalammar.github.io/
LSTM & GRU explanation : https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Text classification using attention mechanism in Keras : http://androidkt.com/text-classification-using-attention-mechanism-in-keras/
Bernoulli Naive Bayes & Multinomial Naive Bayes & Random Forests & Linear SVM & SVM with non-linear kernel https://github.com/irfanelahi-ds/document-classification-python/blob/master/document_classification_python_sklearn_nltk.ipynb and https://richliao.github.io/
DL text classification : https://gitlab.com/the_insighters/data-university/nuggets/document-classification-with-deep-learning
1-D Convolutions over text : http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/ and https://github.com/davidsbatista/ConvNets-for-sentence-classification/blob/master/Convolutional-Neural-Networks-for-Sentence-Classification.ipynb
[Bonus] Sentiment Analysis in PySpark : https://github.com/tthustla/setiment_analysis_pyspark/blob/master/Sentiment%20Analysis%20with%20PySpark.ipynb
RNN Text Generation : https://github.com/priya-dwivedi/Deep-Learning/blob/master/RNN_text_generation/RNN_project.ipynb
Finding similar documents with Word2Vec and Soft Cosine Measure: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb
[!! ESSENTIAL !!] Text Classification with Hierarchical Attention Networks: https://humboldt-wi.github.io/blog/research/information_systems_1819/group5_han/
[ESSENTIAL for any NLP Project]: https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
Doc2Vec + Logistic Regression : https://github.com/susanli2016/NLP-with-Python/blob/master/Doc2Vec%20Consumer%20Complaint_3.ipynb
Doc2Vec -> just embedding: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb
New way of embedding -> Poincaré Embeddings: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Poincare%20Tutorial.ipynb
Doc2Vec + Text similarity: https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Graph Link predictions + Part-of-Speech tagging tutorial with the Keras: https://github.com/Cdiscount/IT-Blog/tree/master/scripts/link-prediction & https://techblog.cdiscount.com/link-prediction-in-large-scale-networks/

Other Topics - Text Similarity [Word Mover Distance] =========================================================

Finding similar documents with Word2Vec and WMD : https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
Introduction to Wasserstein metric (earth mover’s distance): https://yoo2080.wordpress.com/2015/04/09/introduction-to-wasserstein-metric-earth-movers-distance/
Earthmover Distance: https://jeremykun.com/2018/03/05/earthmover-distance/ Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters). For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?
Word Mover’s distance calculation between word pairs of two documents: https://stats.stackexchange.com/questions/303050/word-movers-distance-calculation-between-word-pairs-of-two-documents
Word Mover’s Distance (WMD) for Python: https://github.com/stephenhky/PyWMD/blob/master/WordMoverDistanceDemo.ipynb
[LECTURES] : Computational Optimal Transport : https://optimaltransport.github.io/pdf/ComputationalOT.pdf
Computing the Earth Mover’s Distance under Transformations : http://robotics.stanford.edu/~scohen/research/emdg/emdg.html
[LECTURES] Slides WMD: http://robotics.stanford.edu/~rubner/slides/sld014.htm

Others [Quora Datset] :

BOW + Xgboost Model + Word level TF-IDF + XgBoost + N-gram Level TF-IDF + Xgboost + Character Level TF-IDF + XGboost: https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Xgboost_bow_tfidf.ipynb

8 - Other Topics - Topic Modeling LDA

https://github.com/FelixChop/MediumArticles/blob/master/LDA-BBC.ipynb

https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb

TF-IDF + K-means & Latent Dirichlet Allocation (with Bokeh): https://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
[!! ESSENTIAL !!] Building a LDA-based Book Recommender System: https://humboldt-wi.github.io/blog/research/information_systems_1819/is_lda_final/

9 - Variational Autoencoder

Text generation with a Variational Autoencoder : https://github.com/NicGian/text_VAE
Variational_text_inference : https://github.com/s4sarath/Deep-Learning-Projects/tree/master/variational_text_inference and https://s4sarath.github.io/2016/11/23/variational_autoenocder_for_Natural_Language_Processing

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
pictures		pictures
Readme.md		Readme.md
[Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb		[Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb
[Introduction] - Big tutorial - Text Classification.ipynb		[Introduction] - Big tutorial - Text Classification.ipynb
[Supervised] [DL method] GRU_HAN.ipynb		[Supervised] [DL method] GRU_HAN.ipynb
[Unsupervised] LDA.ipynb		[Unsupervised] LDA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pictures

pictures

Readme.md

Readme.md

[Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb

[Basic] [Document Similarity] [Unsupervised] - TFIDF - BoW - Bag of N-Grams - Kmeans - LDA.ipynb

[Introduction] - Big tutorial - Text Classification.ipynb

[Introduction] - Big tutorial - Text Classification.ipynb

[Supervised] [DL method] GRU_HAN.ipynb

[Supervised] [DL method] GRU_HAN.ipynb

[Unsupervised] LDA.ipynb

[Unsupervised] LDA.ipynb

Repository files navigation

Multi-classes task classification and LDA-based topic Recommender System

1 - Text Mining

2 - Word Embedding

A - Frequency Based Embedding

B - Prediction Based Embedding

LDA

C - Poincaré Embedding [Embeddings and Hyperbolic Geometry]

3 - Algorithms

A - Traditional Methods

B - Deep Learning Methods

4 - Explainability

5 - MyApp of multi-classes text classification with Attention mechanism

6 - Ressources / Bibliography

Others [Quora Datset] :

8 - Other Topics - Topic Modeling LDA

9 - Variational Autoencoder

About

Releases

Packages

Languages

adsieg/Multi_Text_Classification

Folders and files

Latest commit

History

Repository files navigation

Multi-classes task classification and LDA-based topic Recommender System

1 - Text Mining

2 - Word Embedding

A - Frequency Based Embedding

B - Prediction Based Embedding

LDA

C - Poincaré Embedding [Embeddings and Hyperbolic Geometry]

3 - Algorithms

A - Traditional Methods

B - Deep Learning Methods

4 - Explainability

5 - MyApp of multi-classes text classification with Attention mechanism

6 - Ressources / Bibliography

Others [Quora Datset] :

8 - Other Topics - Topic Modeling LDA

9 - Variational Autoencoder

About

Resources

Stars

Watchers

Forks

Languages