Skip to content

kkm24132/DataScience_NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 

Repository files navigation

NLP (Natural Language Processing)

Objective: This repository is created to capture pointers, guidance for fundamentals around NLP (Natural Language Processing) from learning perspective, innovation / research areas etc. It also throws light into recommended subject areas, content relating to accelerating in the journey of learning in this field.

Target Audience: Data Science and AI Practitioners with already having fundamental, working knowledge and familiarity of Machine Learning concepts, Python/R/SQL programming background.

Contents:

Research Focus and Trends

Back to Contents

Research Focus Sub-Segments

  • Lexical Semantics, Semantic Processing
  • POS Tagging
  • Discourse
  • Paraphrasing / Entailment / Generation
  • Machine Translation
  • Information Retrieval
  • Text Mining
  • Information Extraction
  • Question Answering
  • Dialog Systems
  • Spoken Language Processing
  • Speech Recognition & Synthesis
  • Computational Linguistics and NLP
  • Chunking / Shallow Parsing
  • Parsing / Grammatical Formalisms etc.

Introduction and Learning Content

Area Description Target Timeline
Pre-Requisites Week 0
Handling Text Processing Week 1-4
Language Modeling & Sentiment Classification with DL, Translation with RNNs Week 5-8
Reading and handling Text from Images Week 9-12

fast.ai NLP Course Cool links

Back to Contents

Techniques

Back to Contents

Libraries / Packages

  • R NLP Libraries
    • text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R
    • wordVectors - An R package for creating and exploring word2vec and other word embedding models
    • RMallet - R package to interface with the Java machine learning tool MALLET
    • dfr-browser - Creates d3 visualizations for browsing topic models of text in a web browser.
    • dfrtopics - R package for exploring topic models of text.
    • sentiment_classifier - Sentiment Classification using Word Sense Disambiguation and WordNet Reader
  • Python NLP Libraries
    • NLTK - Natural Language ToolKit
    • TextBlob - Simplified text processing. Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of Natural Language Toolkit (NLTK) and Pattern, and plays nicely with both
    • spaCy - Industrial strength NLP with Python and Cython
    • gensim - Python library to conduct unsupervised semantic modelling from plain text
    • scattertext - Python library to produce d3 visualizations of how language differs between corpora
    • GluonNLP - A deep learning toolkit for NLP, built on MXNet/Gluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks.
    • AllenNLP - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.
    • PyTorch-NLP - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU
    • Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)

Back to Contents

Services

  • Amazon Comprehend - NLP and ML suite covers most common tasks like NER (Named Entity Recognition), tagging, and sentiment analysis
  • Google Cloud Natural Language API - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Others
  • Microsoft Cognitive Service: Text Analytics
  • IBM Watson's Natural Language Understanding - API and Github demo
  • Cloudmersive - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation/detection, and sentence parsing
  • ParallelDots - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis
  • Wit.ai - Natural Language Interface for apps and devices
  • Rosette - An adaptable platform for text analytics and discovery
  • TextRazor - Extract meaning from your text
  • Textalytic - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more

Back to Contents

Fundamentals and Basics

  • Stopwords: Stop words are words which occur frequently in a text corpus. e.g a, an, the, in. Frequently occurring words are removed from the corpus for the objective of text-normalization. We can import from nltk.corpus import stopwords to leverage this facility
  • Stemming: It is reduction of inflection from words. Words with same origin will get reduced to a form which may or may not be a word. NLTK has different stemmers which implement different methodologies.

Datasets

Back to Contents

Video and Online Content References

Back to Contents

CourseReferences

Back to Contents

End of Contents

Disclaimer: Information represented here is based on my own experiences, learnings, readings and no way represent any firm's opinion, strategy etc or any individual's opinion or not intended for anything else other than learning and/or research/innovation in the field. Content here and on this repository is non-exhaustive and continuous improvement / continuous learning focus is needed to learn more. Recommendation - Keep Learning and keep improving.

About

Natural Language Processing related focus areas and motivation for research, innovation and learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published