Fake-News

Fake News analysis in R Script: EDA & Prediction (Naive Bayes, Random Forest, SVM, NNET).

Code:

https://github.com/trajceskijovan/Fake-News/blob/main/Fake%20News.R

Presentation:

https://github.com/trajceskijovan/Fake-News/blob/main/Presentation.pdf

EDA:

Top Words by "Title" (only words over 5 characters allowed):

Merge fake and true news and create a plot over time:

Fake news are more frequent in 2016, 2017 and first half of 2018.
In Q4 2018 fake and true news are balanced

Are datasets balanced?

The merged dataset is Balanced - this will make it easier for prediction.

There are 631 missing values in the “Text” column. They are removed.

News count by each Subject, and, Subject by category plots:

Pre-processing & data cleanup steps are outlined below:

Create a corpus (type of object expected by "tm" library)

Text to lower case

Remove numbers

Remove Punctuations

Remove Stopwords

Remove specific words (example: we should remove the name of the newspaper –“Reuters”-its on every news)

Remove Whitespace

I have decided not to do “stem words” as it tweaks and cuts the words and they might lose their meaning

Remove other punctuation issues (example: "[[:punct:]]" )

Lemmatization

Create Document Term Matrix with control list (example: “wordLengths=c(5, 20)”)

I enforced lower and upper limit to the length of the words included (between 5 and 20 characters) to speed up data processing and eliminate noise

After that, I removed all terms whose sparsity is greater than the threshold. Sparsity dropped from 100% to 77% and term length dropped from 20 to 14 (example: “sparse = 0.85”)

Convert DTM to matrix to DataFrame

Post-processing analysis:

Return all terms that occur more than 20,000 times in the entire corpus:

findFreqTerms(dtm.clean,lowfreq=20000)

[1] "american" "campaign" "clinton" "country" "donald" "election" "government" "house"
[9] "include" "obama" "official" "party" "people" "president" "report" "republican" [17] "right" "state" "trump" "unite" "white"

Correlation limit inspection and associations among: “Trump”,” Obama”, “Russia”, “State”:

WordClouds:

Modeling, Prediction, Performance:

Naive Bayes Model
Logistic Regression Model
Random Forest Model

SVM
NNET

Evaluation: ROC

Evaluation: Confusion Matrix

Evaluation: Summary Table

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
samples		samples
Fake News.R		Fake News.R
Presentation.pdf		Presentation.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

samples

samples

Fake News.R

Fake News.R

Presentation.pdf

Presentation.pdf

README.md

README.md

Repository files navigation

Fake-News

Code:

Presentation:

EDA:

Pre-processing & data cleanup steps are outlined below:

Post-processing analysis:

Modeling, Prediction, Performance:

About

Releases

Packages

Languages

trajceskijovan/Fake-News

Folders and files

Latest commit

History

Repository files navigation

Fake-News

Code:

Presentation:

EDA:

Pre-processing & data cleanup steps are outlined below:

Post-processing analysis:

Modeling, Prediction, Performance:

About

Topics

Resources

Stars

Watchers

Forks

Languages