Text Classification

Classify BBC articles into categories with scikit-learn

How It Works

There are two data sets. The train_set.csv with 12.267 data points and the test_set.csv with 3.068 data points. The train set contains 5 columns per article. ID, Title, Content, Category(Politics, Film, Football, Business, Technology) and RowNum. Our goal is to find the best classifier for this specific train set and then use it to classify the articles of the test set.

At first, you can gain an insight into the data set by running the wordcloud.py module to generate one Word Cloud for each category. Then, the next step is to preprocess and convert the content of each article into a vector representation, excluding stop-words and using the TFIDF Vectorizer method. After that, there is an additional step where each vector is down-sampled to a lower dimension to reduce the training time of each model and even increase their accuracy, as irrelevant and redundant information may be removed during this step. The best dimension with the best trade-off between accuracy and training time is 100 dimensions(you can take a look into the lsi_plot.png). The next step is to use the 10-fold cross-validation method to train different models with our train set and find the best hyper-parameters(grid_search modules). Then, by running the train_models.py module, you can find which model performs the best to this specific problem(using the train set). After finding the best model and adding some extra preprocessing steps such as Porter Stemming and appending the title to each TFIDF vector, you can run the beat_the_benchmark.py module to predict the categories of the test articles.

Note: the knn.py module is a custom implementation of the k-nearest neighbours algorithm

Requirements

Python 2.7
Scikit-learn
Pandas
NLTK
matplotlib
wordcloud

Helpful Links:

Authors

Petropoulakis Panagiotis petropoulakispanagiotis@gmail.com
Andreas Charalambous and.charalampous@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
datasets		datasets
Business.png		Business.png
EvaluationMetric_10fold.csv		EvaluationMetric_10fold.csv
Film.png		Film.png
Football.png		Football.png
LICENSE		LICENSE
Politics.png		Politics.png
README.md		README.md
Technology.png		Technology.png
beat_the_benchmark.py		beat_the_benchmark.py
grid_search_SVM.py		grid_search_SVM.py
grid_search_mnb.py		grid_search_mnb.py
grid_search_rf.py		grid_search_rf.py
knn.py		knn.py
lsi_plot.png		lsi_plot.png
testSet_categories.csv		testSet_categories.csv
train_Knn.py		train_Knn.py
train_models.py		train_models.py
wordCloud.py		wordCloud.py

License

PetropoulakisPanagiotis/text-classification

Folders and files

Latest commit

History

Repository files navigation

Text Classification

How It Works

Requirements

Helpful Links:

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Languages