NLP (Natural Language Processing)

Objective: This repository is created to capture pointers, guidance for fundamentals around NLP (Natural Language Processing) from learning perspective, innovation / research areas etc. It also throws light into recommended subject areas, content relating to accelerating in the journey of learning in this field.

Target Audience: Data Science and AI Practitioners with already having fundamental, working knowledge and familiarity of Machine Learning concepts, Python/R/SQL programming background.

Research Focus and Trends

Updates in 2022-2023:
- Generative AI from Text-to-Image generation standpoint:
  - Hierarchical Text-Conditional Image Generation with CLIP Latents DALL-E 2
  - High-Resolution Image Synthesis with Latent Diffusion Models Stable Diffusion
  - LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models - CLIP is used
Please keep referring to NLP related research papers from AAAI, NeurIPS, ACL, ICLR and similar conferences for latest research focus areas. Most of these may be captured in the arXiv.org site as well.
Few latest and key research papers for reading are as follows: (Please note this keeps changing and may not be dated)
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale - the GitHub page
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer - the GitHub page with pretrained models along with the dataset and code
- Reformer: The Efficient Transformer - the GitHub page with official code implementation from Google and the GitHub page with PyTorch implementation of Reformer
- Longformer: The Long-Document Transformer - the GitHub page
NLP-Progress tracks the progress in Natural Language Processing, including the datasets and the current state-of-the-art for the most common NLP tasks.
NLP-Overview is an up-to-date overview of deep learning techniques applied to NLP, including theory, implementations, applications, and state-of-the-art results. This is a great Deep NLP Introduction for researchers.
Detect Radiology related entities with Spark NLP
NLP's ImageNet moment
ACL 2018 Highlights: Understanding Representations and Evaluation in More Challenging Settings
Four deep learning trends from ACL 2017 - Part 1 - Linguistic Structure and Word Embeddings
Four deep learning trends from ACL 2017 - Part 2 - Interpretability and Attention
Deep Learning for NLP: Advancements & Trends
Deep Learning for NLP : without Magic
Stanford NLP
BERT, ELMo and GPT2 How contextual are Contexualized Word Representations? - from Stanford AI Lab
The Illustrated BERT, ELMo and others NLP and transfer learning context
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, Related Code
A Mutual Information Maximization Perspective of Language Representation Learning
DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

Back to Contents

Research Focus Sub-Segments

Lexical Semantics, Semantic Processing
POS Tagging
Discourse
Paraphrasing / Entailment / Generation
Machine Translation
Information Retrieval
Text Mining
Information Extraction
Question Answering
Dialog Systems
Spoken Language Processing
Speech Recognition & Synthesis
Computational Linguistics and NLP
Chunking / Shallow Parsing
Parsing / Grammatical Formalisms etc.

Introduction and Learning Content

Area	Description	Target Timeline
Pre-Requisites	Familiarity with Python Programming - Some Ref Descriptive Stats by Khan Academy The Elements of Statistical Learning - ISLR Book Reference by Hasti,Tishirani et al Machine Learning Fundamentals Andrew Ng's course around ML Familiarity with Data Science processes and frameworks CRISP-DM	Week 0
Handling Text Processing	Text pre-processing techniques (Familiarity with spaCy library, familiarity with NLTK library, Tokenization using spaCy library, Stopword removal and text normalization ) Regular expressions Exploratory Analysis with Text data Extract Meta Features from text Build a text classification model Practice Problem - Identify Sentiments ..can be any such equivalent problem for experience	Week 1-4
Language Modeling & Sentiment Classification with DL, Translation with RNNs	Language Model Transfer Learning Sentiment Classification Predicting English word version of numbers using an RNN Transfer Learning for Natural Language Modeling using imdb	Week 5-8
Reading and handling Text from Images	OpenCV - Ref PyTesseract - Tesseract software Wiki Ref Here is an example to read text from images	Week 9-12

fast.ai NLP Course Cool links

Back to Contents

Techniques

Text Embeddings
- Word Embeddings
  - Thumb Rule: fastText >> GloVe > word2vec
  - Implementation from Facebook Research - fastText
  - gloVe : Global Vectors for Word Representation - Explainer Blog
  - word2vec - Implementation - Explainer Blog
- Sentence and Language Model Based Word Embeddings
  - ElMo : Embeddings from Language Models : Basics , Deep contextualized word representations
    - PyTorch Implementation from AllenAI/AllenNLP
    - TF Implementation from AllenAI
  - ULMFiT : Universal Language Model Fine-tuning for Text Classification by Jeremy Howard and Sebastian Ruder - Paper Ref
  - InferSent - Supervised Learning of Universal Sentence Representations from Natural Language Inference Data by facebook - Paper Ref
Question Answering and Knowledge Extraction
- DrQA - Open Domain Question Answering work by Facebook Research on Wikipedia data
- Document-QA - Simple and Effective Multi-Paragraph Reading Comprehension by AllenAI
- Privee - An Architecture for Automatically Analyzing Web Privacy Policies
- Template-Based Information Extraction without the Templates

Back to Contents

Libraries / Packages

R NLP Libraries
- text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R
- wordVectors - An R package for creating and exploring word2vec and other word embedding models
- RMallet - R package to interface with the Java machine learning tool MALLET
- dfr-browser - Creates d3 visualizations for browsing topic models of text in a web browser.
- dfrtopics - R package for exploring topic models of text.
- sentiment_classifier - Sentiment Classification using Word Sense Disambiguation and WordNet Reader
Python NLP Libraries
- NLTK - Natural Language ToolKit
- TextBlob - Simplified text processing. Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both
- spaCy - Industrial strength NLP with Python and Cython
- gensim - Python library to conduct unsupervised semantic modelling from plain text
- scattertext - Python library to produce d3 visualizations of how language differs between corpora
- GluonNLP - A deep learning toolkit for NLP, built on MXNet/Gluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks.
- AllenNLP - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
- PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU
- Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)

Back to Contents

Services

Amazon Comprehend - NLP and ML suite covers most common tasks like NER (Named Entity Recognition), tagging, and sentiment analysis
Google Cloud Natural Language API - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Others
Microsoft Cognitive Service: Text Analytics
IBM Watson's Natural Language Understanding - API and Github demo
Cloudmersive - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing
ParallelDots - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis
Wit.ai - Natural Language Interface for apps and devices
Rosette - An adaptable platform for text analytics and discovery
TextRazor - Extract meaning from your text
Textalytic - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more

Back to Contents

Fundamentals and Basics

Stopwords: Stop words are words which occur frequently in a text corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the objective of text-normalization. We can import from nltk.corpus import stopwords to leverage this facility
Stemming: It is reduction of inflection from words. Words with same origin will get reduced to a form which may or may not be a word. NLTK has different stemmers which implement different methodologies.

Datasets

NLP-datasets - Great collection of NLP datasets for use
gensim-datasets - Data repository for pretrained NLP models and NLP corpora

Back to Contents

Video and Online Content References

Stanford Deep Learning for Natural Language Processing (cs224-n) - Richard Socher and Christopher Manning's Stanford Course
Deep Natural Language Processing - Lectures series from Oxford
Neural Networks for NLP - Carnegie Mellon Language Technology Institute there
fast.ai Code-First Intro to Natural Language Processing - This covers a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, GRUs, and the Transformer), as well as addressing urgent ethical issues, such as bias and disinformation. Find the Jupyter Notebooks here
Deep NLP Course by Yandex Data School, covering important ideas from text embedding to machine translation including sequence modeling, language models and so on
Machine Learning University - Accelerated Natural Language Processing - Lectures go from introduction to NLP and text processing to Recurrent Neural Networks and Transformers. Material can be found here
Knowledge Graphs in Natural Language Processing @ ACL 2020
Practical Notebooks around 330+ leveraging NLP techniques
NLP at Scale - MLOps aspects for customer success
MLM with BERT

Back to Contents

CourseReferences

Back to Contents

End of Contents

Disclaimer: Information represented here is based on my own experiences, learnings, readings and no way represent any firm's opinion, strategy etc or any individual's opinion or not intended for anything else other than learning and/or research/innovation in the field. Content here and on this repository is non-exhaustive and continuous improvement / continuous learning focus is needed to learn more. Recommendation - Keep Learning and keep improving.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

NLP (Natural Language Processing)

Contents:

Research Focus and Trends

Research Focus Sub-Segments

Introduction and Learning Content

Techniques

Libraries / Packages

Services

Fundamentals and Basics

Datasets

Video and Online Content References

CourseReferences

About

Releases

Packages

kkm24132/DataScience_NLP

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

NLP (Natural Language Processing)

Contents:

Research Focus and Trends

Research Focus Sub-Segments

Introduction and Learning Content

Techniques

Libraries / Packages

Services

Fundamentals and Basics

Datasets

Video and Online Content References

CourseReferences

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages