GitHub - kedir/GLG--Topic-Modeling-and-Document-Clustering: Cluster documents and extract global and local topics per cluster using LDA (Latent Dirichlet Allocation) algorithm

Fourthbrain NLP Capstone Project (GLG Topic Modeling and Named Entity Recognition)

Table of Contents

Project Description
Project Objective
Built With
Data Sources
Topic Modeling Pipeline
Named Entity Recognition
Getting Started
- Prerequisites
- Installation with Docker-compose
Usage
Support
License
Contact
References

Project Description

Gerson Lehrman Group(GLG) is a financial and information services firm. It is insight network that connects decision makers to a network of experts so they can act with the confidence that comes from true clarity and have what it takes to get ahead. GLG receives a large amount requests (including requests related to health and tech) from clients seeking insights on different topics. Manually preprocessing these client requests and extracting relevant topics/keywords is time-consuming and requires a large manpower. This project uses Natural Language Processing (NLP) to improve the topic/keyword detection process from client-submitted reports and identifying the underlying patterns in submitted requests over time. The primary challenges include Named Entity Recognition (NER) and Pattern Recognition for Hierarchical Clustering of Topics.

(back to top)

Project Objective

The purpose of this project is to develop an NLP model capable of recognizing and clustering topics related to technological and healthcare terms given a large text corpus and to develop an NER model capable of extracting entities from a given sentence.

(back to top)

Built With

Python
- NumPy/pandas
- Scikit-learn
- Matplotlib
- Keras
- PyTorch
- Seaborn
- Streamlit
Language Models
- SBERT
- NLTK
Jupyter Notebook
Visual Studio Code

(back to top)

Data Sources

All the News 2.0 — This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020.
Annotated Corpus for NER — Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. The different entities in this dataset are:

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time Indicator
art = Artificact
eve = Event
nat =Natural Phenomenon

(back to top)

Topic Modeling Pipeline

Topic models are useful tools to discover latent topics in collections of documents. In this section below, we look into details of the various parts of the topic modeling pipeline with highlights and key findings.

Data Cleaning and Data Exploration: The first step in the pipeline is data cleaning and data exploration of the news article dataset. From the original data we extract the news articles that focus only on the health and technology section. Then we performed different kinds of text data cleaning steps like:

punctuation and non-alphanumeric character removal.
Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Remove Words that have fewer than 3 characters.
Stopword removal.
Lemmatization

Document Embedding: We embed documents to create representations in vector space that can be compared semantically. We assume that documents containing the same topic are semantically similar. To perform the embedding step, first we extract a sentences in each document using NLTK sentence tokenizer and we apply the [Sentence-BERT (SBERT) framework](https://arxiv.org/abs/1908.10084) in each sentence and generate vector representation for each sentence, finally we represent a single document using dot product of each sentence vector representation and generate an embedding vector for the document. These embeddings, however, are primarily used to cluster semantically similar documents and not directly used in generating the topics.

Feature Reduction: In the above document embedding step, we embed each document using SBERT which generates a 768 long dense vector. Working with such a high dimension vector is computationally heavy and complex, hence, we apply dimensionality reduction technique called UMAP([Uniform Manifold Approximation and Projection](http://arxiv.org/abs/1802.03426)) to reduce the number of features/vectors without losing important information.

Document clustering: Finally we apply the [HDBSCAN](https://www.theoj.org/joss-papers/joss.00205/10.21105.joss.00205.pdf) (Hierarchical density based clustering) algorithm in order to extract clusters of semantically similar documents. It is an ex-tension of DBSCAN that finds clusters of varying densities by converting DBSCAN into a hierarchi-cal clustering algorithm. HDBSCAN models clusters using a soft-clustering approach allowing noise to be modeled as outliers. This prevents unrelated documents from being assigned to any cluster and is expected to improve topic representations.

Topic Representation: The topic representations are modeled based on the documents in each cluster where each cluster will be assigned more than one Global and Local topics. Using HDBSCAN algorithm we access Hierarchical structure of the documents in each cluster. This means in each cluster the documents distributed as parent and child hierarchical structure. Therefore, for each cluster we can extract Global and Local topics by applying the [LDA](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) (Latent Dirichlet Allocation) model on those documents. Thus, we have 2 LDA Models for each cluster responsible to generate Global and Local topics for parent and child documents respectively.

(back to top)

Named Entity Recognition

NER is a widely used NLP technique that recognizes entities contained in a piece of text, commonly things like people organization, locations etc. This project also includes an NER model implemented using BERT and huggingface PyTorch library to quickly and efficiently fine-tune the BERT model to do the state of the art performance in Named Entity Recognition. The transformer package provides a BertForTokenClassification class for token-level predictions. BertForTokenClassification is a fine-tuning model that wraps BertModel and adds a token-level classifier on top of the BertModel. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.
Below is an example of an input and output of our named entity model, served with a streamlit app.

(back to top)

Getting Started

Prerequisites

Install Git lfs for installation guide see tutorial
Install Docker for installation guide see tutorial
Install Docker Compose for installation guide see tutorial

Installation with Docker-compose

To package the whole solution which uses multiple images/containers, I used Docker Compose. Please follow the steps below for successful installation.

Clone the repo

git lfs clone https://github.com/kedir/GLG--Topic-Modeling-and-Document-Clustering.git

Go to the project directory

cd GLG--Topic-Modeling-and-Document-Clustering

Create a bridge network Since we have multiple containers communicating with each other, I created a bridge network called AIservice. First create the network AIService by running this command:
```
docker network create AIservice
```
Run the whole application by executing this command:
```
docker-compose up -d --build
```

(back to top)

Usage

Frontend app with Streamlit

You can see the frontend app in the browser using : http://localhost:8501/ or If you are launching the app in the cloud, replace localhost with your public Ip address.

Please refer to this Documentation for more.

For more examples, please refer to the Documentation

(back to top)

Support

Contributions, issues, and feature requests are welcome!

Give a ⭐️ if you like this project!

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

Kedir Ahmed - @linkedin - kedirhamid@gmail.com
Ranganai Gwati - ranganaigwati@gmail.com
Aklilu Gebremichail - akliluet@gmail.com

(back to top)

References

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
app/client		app/client
data		data
doc		doc
models		models
notebooks		notebooks
src_ner		src_ner
src_topic		src_topic
ui-frontend		ui-frontend
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

License

kedir/GLG--Topic-Modeling-and-Document-Clustering

Folders and files

Latest commit

History

Repository files navigation

Fourthbrain NLP Capstone Project (GLG Topic Modeling and Named Entity Recognition)

Project Description

Project Objective

Built With

Data Sources

Topic Modeling Pipeline

Named Entity Recognition

Getting Started

Prerequisites

Installation with Docker-compose

Usage

Support

License

Contact

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages