Skip to content

This project offers advanced techniques in text preprocessing, word embeddings, and text classification. Explore methods like Word2Vec and GloVe, and master Multinomial Naive Bayes for accurate predictions. Dive into the world of text clustering and conquer challenges like unbalanced data.

Notifications You must be signed in to change notification settings

Purushothaman-natarajan/NLP-TEXT-PROCESSING

Repository files navigation

Text Analysis with 20 Newsgroups Dataset

This repository contains code and resources for conducting text analysis on the 20 Newsgroups dataset. The analysis includes text preprocessing, word embeddings, text classification, and text clustering tasks. Below is an overview of the contents and steps involved in this project.

Dataset

The 20 Newsgroups dataset can be accessed here. It contains a collection of approximately 20,000 newsgroup documents, across 20 different newsgroups.

Preprocessing

  1. Noise Removal: Remove unwanted elements such as digits, special characters, and irrelevant text.
  2. Lowercasing: Convert all text to lowercase to address sparsity issues.
  3. Normalization:
    • Stop-word Removal: Eliminate common words that do not contribute significant meaning.
    • Lemmatization: Reduce words to their base form for meaningful analysis.
    • Spelling Correction: Correct spelling errors in the text.

Word Embeddings

Implement and compare different word embedding techniques:

  • Word2Vec: Utilizes Continuous Bag of Words (CBOW) and Skip-gram models.
  • GloVe: Generates word vectors based on global word co-occurrence statistics.
  • OpenAI Embeddings: Use pretrained embeddings from OpenAI models.

Text Classification

Implement Multinomial Naive Bayes for text classification. Steps include:

  • Feature Extraction: Convert text data into numerical features using word embeddings.
  • Model Training: Train Multinomial Naive Bayes classifier.
  • Hyperparameter Tuning: Use RandomizedSearchCV to find the best set of hyperparameters.
  • Evaluation: Evaluate the model using classification metrics like precision, recall, and F1-score.

Text Clustering

Implement KNeighborsClassifier for text clustering. Additional technique used:

  • K-fold Cross Validation: Evaluate model performance using K-fold cross-validation.

Challenges and Improvements

Address challenges like unbalanced data, slang/abbreviations, word disambiguation, and provide suggestions for improvements.

Contributing

If you find any issues or have suggestions for improvements, feel free to get in touch.

Happy analyzing! 📊📚

About

This project offers advanced techniques in text preprocessing, word embeddings, and text classification. Explore methods like Word2Vec and GloVe, and master Multinomial Naive Bayes for accurate predictions. Dive into the world of text clustering and conquer challenges like unbalanced data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published