GitHub

Testing the best algorithm for "Text Classification"

This is an experiment of choosing the best algorithm to classify text with python.

Dataset

The dataset used for this experiment is "Twenty Newsgroups" dataset. The dataset is stored on the folder 'dataset/' inside the root folder. For this only 6 of the 20 newsgroups are chosen:

`space`, `graphics`, `windows`, `religion`, `motorcycles` and `forsale`

UTF-8 incompatibility

Some of the supplied text files had incompatibility with UTF-8. So, they are deleted as the part of preprocessing.

Requirements

python 2.7
python modules:
- scikit-learn
- colorama
- termcolor

Running the code

python2.7 main.py

Experiments

For the experiments, we assume that we like graphics, space and religion newsgroups, and we dislike windows, motorcycles and forsale newsgroups.

For a test size of 20%, we have three different experiments:

TFIDF with Naive Bayes

Results:

             precision    recall  f1-score   support

   dislikes       0.94      0.97      0.96       574
      likes       0.97      0.93      0.95       530

avg / total       0.96      0.95      0.95      1104

TFIDF with Support Vector Machine

Results:

             precision    recall  f1-score   support

   dislikes       0.97      0.98      0.97       594
      likes       0.98      0.96      0.97       510

avg / total       0.97      0.97      0.97      1104

TFIDF with K-Nearest Neighbours

Results:

             precision    recall  f1-score   support

   dislikes       0.95      0.93      0.94       558
      likes       0.93      0.95      0.94       546

avg / total       0.94      0.94      0.94      1104

Conclusion

These experiments concludes that TFIDF with Support Vector Machine (SVM) yielded the best results than other algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
dataset		dataset
README.md		README.md
main.py		main.py
main.pyc		main.pyc
util.py		util.py
util.pyc		util.pyc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

README.md

README.md

main.py

main.py

main.pyc

main.pyc

util.py

util.py

util.pyc

util.pyc

Repository files navigation

Testing the best algorithm for "Text Classification"

Dataset

UTF-8 incompatibility

Requirements

Running the code

Experiments

TFIDF with Naive Bayes

TFIDF with Support Vector Machine

TFIDF with K-Nearest Neighbours

Conclusion

About

Releases

Packages

Languages

manishrsilwal/text_classification_ML

Folders and files

Latest commit

History

Repository files navigation

Testing the best algorithm for "Text Classification"

Dataset

UTF-8 incompatibility

Requirements

Running the code

Experiments

TFIDF with Naive Bayes

TFIDF with Support Vector Machine

TFIDF with K-Nearest Neighbours

Conclusion

About

Resources

Stars

Watchers

Forks

Languages