This is an experiment of choosing the best algorithm to classify text with python.
The dataset used for this experiment is "Twenty Newsgroups" dataset. The dataset is stored on the folder 'dataset/' inside the root folder. For this only 6 of the 20 newsgroups are chosen:
`space`, `graphics`, `windows`, `religion`, `motorcycles` and `forsale`
Some of the supplied text files had incompatibility with UTF-8. So, they are deleted as the part of preprocessing.
-
python 2.7
-
python modules:
- scikit-learn
- colorama
- termcolor
python2.7 main.py
For the experiments, we assume that we like graphics
, space
and religion
newsgroups, and we dislike windows
, motorcycles
and forsale
newsgroups.
For a test size of 20%, we have three different experiments:
Results:
precision recall f1-score support
dislikes 0.94 0.97 0.96 574
likes 0.97 0.93 0.95 530
avg / total 0.96 0.95 0.95 1104
Results:
precision recall f1-score support
dislikes 0.97 0.98 0.97 594
likes 0.98 0.96 0.97 510
avg / total 0.97 0.97 0.97 1104
Results:
precision recall f1-score support
dislikes 0.95 0.93 0.94 558
likes 0.93 0.95 0.94 546
avg / total 0.94 0.94 0.94 1104
These experiments concludes that TFIDF with Support Vector Machine (SVM) yielded the best results than other algorithms.