Context: Course assignment

This is my solution for coding assignment in Supervised Learning course. We were supposed to find suitable dataset for NLP and then apply appropriate ML algorithms. I chose data set with fake news classification.

What was the main goal? Correctly classify unreliable news (fake news) and reliable news.

What were tasks/steps to accomplish it?

Preprocess the data.
Choose a suitable machine learning classifier.
Justify and explain the output.

Data: News

Dataset called 'Fake News' is retrievable from Kaggle. It contains mix of unreliable and reliable news.

Metadata:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks the article as potentially unreliable
- 1: unreliable
- 0: reliable

Process and results

Comments on process are directly in the code. Here is quick overview:

Data load and simple Exploratory Data Analysis (EDA).
Data preprocessing phase, that included imputation of nulls values and standard NLP preprocessing stapes (removing stop words and multi-spaces, lowercasing, tokenization and lemmatization).
Vectorization using Bag of Words and TF-IDF methods.
Using Naive Bayes and Logistic Regression for classification.

Logistic Regression outperformed Naive Bayes in this task, as it has higher accuracy, precision and recall. The difference varied across changing alpha value of Naive Bayes model (0.04-0.11 in accuracy).

What looks suspicious is the exact same value of Logistic Regression model in all evaluation metrics. Even though my work was approved and no significant mistakes and issues were identified, I got suggestion to inspect this unusual occurrence.

Thus, I drafted this to-do list for future work:

Make and visualize confusion matrix.
Try cross-validation.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md
fake_news_classification.ipynb		fake_news_classification.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

fake_news_classification.ipynb

fake_news_classification.ipynb

requirements.txt

requirements.txt

Repository files navigation

Context: Course assignment

Data: News

Process and results

About

Releases

Packages

Languages

FilipKopecky/fake_news_classification

Folders and files

Latest commit

History

Repository files navigation

Context: Course assignment

Data: News

Process and results

About

Resources

Stars

Watchers

Forks

Languages