Text Analysis with 20 Newsgroups Dataset

This repository contains code and resources for conducting text analysis on the 20 Newsgroups dataset. The analysis includes text preprocessing, word embeddings, text classification, and text clustering tasks. Below is an overview of the contents and steps involved in this project.

Dataset

The 20 Newsgroups dataset can be accessed here. It contains a collection of approximately 20,000 newsgroup documents, across 20 different newsgroups.

Preprocessing

Noise Removal: Remove unwanted elements such as digits, special characters, and irrelevant text.
Lowercasing: Convert all text to lowercase to address sparsity issues.
Normalization:
- Stop-word Removal: Eliminate common words that do not contribute significant meaning.
- Lemmatization: Reduce words to their base form for meaningful analysis.
- Spelling Correction: Correct spelling errors in the text.

Word Embeddings

Implement and compare different word embedding techniques:

Word2Vec: Utilizes Continuous Bag of Words (CBOW) and Skip-gram models.
GloVe: Generates word vectors based on global word co-occurrence statistics.
OpenAI Embeddings: Use pretrained embeddings from OpenAI models.

Text Classification

Implement Multinomial Naive Bayes for text classification. Steps include:

Feature Extraction: Convert text data into numerical features using word embeddings.
Model Training: Train Multinomial Naive Bayes classifier.
Hyperparameter Tuning: Use RandomizedSearchCV to find the best set of hyperparameters.
Evaluation: Evaluate the model using classification metrics like precision, recall, and F1-score.

Text Clustering

Implement KNeighborsClassifier for text clustering. Additional technique used:

K-fold Cross Validation: Evaluate model performance using K-fold cross-validation.

Challenges and Improvements

Address challenges like unbalanced data, slang/abbreviations, word disambiguation, and provide suggestions for improvements.

Contributing

If you find any issues or have suggestions for improvements, feel free to get in touch.

Happy analyzing! 📊📚

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
20 News data - Text Processing (html).html		20 News data - Text Processing (html).html
20 News data - Text Processing (notebook).ipynb		20 News data - Text Processing (notebook).ipynb
20 News data - Text Processing (pdf).pdf		20 News data - Text Processing (pdf).pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20 News data - Text Processing (html).html

20 News data - Text Processing (html).html

20 News data - Text Processing (notebook).ipynb

20 News data - Text Processing (notebook).ipynb

20 News data - Text Processing (pdf).pdf

20 News data - Text Processing (pdf).pdf

README.md

README.md

Repository files navigation

Text Analysis with 20 Newsgroups Dataset

Dataset

Preprocessing

Word Embeddings

Text Classification

Text Clustering

Challenges and Improvements

Contributing

About

Releases

Packages

Languages

Purushothaman-natarajan/NLP-TEXT-PROCESSING

Folders and files

Latest commit

History

Repository files navigation

Text Analysis with 20 Newsgroups Dataset

Dataset

Preprocessing

Word Embeddings

Text Classification

Text Clustering

Challenges and Improvements

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Languages