Skip to content

Cluster documents and extract global and local topics per cluster using LDA (Latent Dirichlet Allocation) algorithm

License

Notifications You must be signed in to change notification settings

kedir/GLG--Topic-Modeling-and-Document-Clustering

Repository files navigation

Forks Stargazers Issues MIT License LinkedIn


Fourthbrain NLP Capstone Project (GLG Topic Modeling and Named Entity Recognition)

NLP Capstone Project YouTube Demo Link

Table of Contents
  1. Project Description
  2. Project Objective
  3. Built With
  4. Data Sources
  5. Topic Modeling Pipeline
  6. Named Entity Recognition
  7. Getting Started
  8. Usage
  9. Support
  10. License
  11. Contact
  12. References

Project Description

Gerson Lehrman Group(GLG) is a financial and information services firm. It is insight network that connects decision makers to a network of experts so they can act with the confidence that comes from true clarity and have what it takes to get ahead. GLG receives a large amount requests (including requests related to health and tech) from clients seeking insights on different topics. Manually preprocessing these client requests and extracting relevant topics/keywords is time-consuming and requires a large manpower. This project uses Natural Language Processing (NLP) to improve the topic/keyword detection process from client-submitted reports and identifying the underlying patterns in submitted requests over time. The primary challenges include Named Entity Recognition (NER) and Pattern Recognition for Hierarchical Clustering of Topics.

(back to top)

Project Objective

The purpose of this project is to develop an NLP model capable of recognizing and clustering topics related to technological and healthcare terms given a large text corpus and to develop an NER model capable of extracting entities from a given sentence.

(back to top)

Built With

  • Python
    • NumPy/pandas
    • Scikit-learn
    • Matplotlib
    • Keras
    • PyTorch
    • Seaborn
    • Streamlit
  • Language Models
    • SBERT
    • NLTK
  • Jupyter Notebook
  • Visual Studio Code

(back to top)

Data Sources

  • All the News 2.0 — This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020.

  • Annotated Corpus for NER — Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. The different entities in this dataset are:

  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time Indicator
  • art = Artificact
  • eve = Event
  • nat =Natural Phenomenon

(back to top)

Topic Modeling Pipeline

alt text

Topic models are useful tools to discover latent topics in collections of documents. In this section below, we look into details of the various parts of the topic modeling pipeline with highlights and key findings.

Data Cleaning and Data Exploration: The first step in the pipeline is data cleaning and data exploration of the news article dataset. From the original data we extract the news articles that focus only on the health and technology section. Then we performed different kinds of text data cleaning steps like:

  • punctuation and non-alphanumeric character removal.
  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Remove Words that have fewer than 3 characters.
  • Stopword removal.
  • Lemmatization

Document Embedding: We embed documents to create representations in vector space that can be compared semantically. We assume that documents containing the same topic are semantically similar. To perform the embedding step, first we extract a sentences in each document using NLTK sentence tokenizer and we apply the [Sentence-BERT (SBERT) framework](https://arxiv.org/abs/1908.10084) in each sentence and generate vector representation for each sentence, finally we represent a single document using dot product of each sentence vector representation and generate an embedding vector for the document. These embeddings, however, are primarily used to cluster semantically similar documents and not directly used in generating the topics.

Feature Reduction: In the above document embedding step, we embed each document using SBERT which generates a 768 long dense vector. Working with such a high dimension vector is computationally heavy and complex, hence, we apply dimensionality reduction technique called UMAP([Uniform Manifold Approximation and Projection](http://arxiv.org/abs/1802.03426)) to reduce the number of features/vectors without losing important information.

Document clustering: Finally we apply the [HDBSCAN](https://www.theoj.org/joss-papers/joss.00205/10.21105.joss.00205.pdf) (Hierarchical density based clustering) algorithm in order to extract clusters of semantically similar documents. It is an ex-tension of DBSCAN that finds clusters of varying densities by converting DBSCAN into a hierarchi-cal clustering algorithm. HDBSCAN models clusters using a soft-clustering approach allowing noise to be modeled as outliers. This prevents unrelated documents from being assigned to any cluster and is expected to improve topic representations.

Topic Representation: The topic representations are modeled based on the documents in each cluster where each cluster will be assigned more than one Global and Local topics. Using HDBSCAN algorithm we access Hierarchical structure of the documents in each cluster. This means in each cluster the documents distributed as parent and child hierarchical structure. Therefore, for each cluster we can extract Global and Local topics by applying the [LDA](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) (Latent Dirichlet Allocation) model on those documents. Thus, we have 2 LDA Models for each cluster responsible to generate Global and Local topics for parent and child documents respectively.

(back to top)

Named Entity Recognition

  • NER is a widely used NLP technique that recognizes entities contained in a piece of text, commonly things like people organization, locations etc. This project also includes an NER model implemented using BERT and huggingface PyTorch library to quickly and efficiently fine-tune the BERT model to do the state of the art performance in Named Entity Recognition. The transformer package provides a BertForTokenClassification class for token-level predictions. BertForTokenClassification is a fine-tuning model that wraps BertModel and adds a token-level classifier on top of the BertModel. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.

  • Below is an example of an input and output of our named entity model, served with a streamlit app.

alt text

(back to top)

Getting Started

Prerequisites

  1. Install Git lfs for installation guide see tutorial
  2. Install Docker for installation guide see tutorial
  3. Install Docker Compose for installation guide see tutorial

Installation with Docker-compose

To package the whole solution which uses multiple images/containers, I used Docker Compose. Please follow the steps below for successful installation.

  1. Clone the repo
    git lfs clone https://github.com/kedir/GLG--Topic-Modeling-and-Document-Clustering.git
  2. Go to the project directory
    cd GLG--Topic-Modeling-and-Document-Clustering
  3. Create a bridge network Since we have multiple containers communicating with each other, I created a bridge network called AIservice. First create the network AIService by running this command:
    docker network create AIservice
  4. Run the whole application by executing this command:
    docker-compose up -d --build

(back to top)

Usage

Frontend app with Streamlit

You can see the frontend app in the browser using : http://localhost:8501/ or If you are launching the app in the cloud, replace localhost with your public Ip address.

alt text

Please refer to this Documentation for more.

For more examples, please refer to the Documentation

(back to top)

Support

Contributions, issues, and feature requests are welcome!

Give a ⭐️ if you like this project!

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Contact

(back to top)

References

(back to top)

Releases

No releases published

Packages

No packages published

Languages