Data-Science-Advanced-Capstone

Teacher's github repo for this class: https://github.com/IBM/skillsnetwork/tree/master/coursera_capstone

Some docs:

Real and Fake news Dataset: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=True.csv

Excellent academic paper on previous work in this area: Automatic Detection of Fake News Veronica Perez-Rosas 1, Bennett Kleinberg 2, Alexandra Lefevre 1 Rada Mihalcea 11 Computer Science and Engineering, University of Michigan Department of Psychology, University of Amsterdam vrncapr@umich.edu, b.a.r.kleinberg@uva.nl, mihalcea@umich.edu Abstract The proliferation of misleading information in everyday access media outlets such as social media feeds, news blogs, and online newspapers have made it challenging to identify trustworthynews sources, thus increasing the need for computational tools able to provide insights into thereliability of online content. In this paper, we focus on the automatic identification of fake con-tent in online news. Our contribution is twofold. First, we introduce two novel datasets for thetask of fake news detection, covering seven different news domains. We describe the collection,annotation, and validation process in detail and present several exploratory analyses on the iden-tification of linguistic differences in fake and legitimate news content. Second, we conduct aset of learning experiments to build accurate fake news detectors, and show that we can achieveaccuracies of up to 76%. In addition, we provide comparative analyses of the automatic andmanual identification of fake news. https://www.aclweb.org/anthology/C18-1287.pdf

Description of Claim Review, a factchecking centralized datafeed from a number of factchecking organizations: https://www.storybench.org/how-claimreview-is-simplifying-the-process-of-fact-checking/

Video on using Keras to use sentiment analysis on IMDB reviews: https://www.coursera.org/learn/ai/lecture/hQYsN/recurrent-neural-networks

Other articles on classifying real/fake news: https://opendatascience.com/how-to-build-a-fake-news-classification-model/

https://www.kaggle.com/anthonyc1/fake-news-classifier-final-project

FakeNewsNet scripts and description: https://github.com/KaiDMML/FakeNewsNet FakeNewsNet article using data above: http://sbp-brims.org/2018/proceedings/papers/challenge_papers/SBP-BRiMS_2018_paper_120.pdf

A paper: “Liar, Liar Pants on Fire”:A New Benchmark Dataset for Fake News Detection “Liar, Liar Pants on Fire”:A New Benchmark Dataset for Fake News Detection William Yang Wang Department of Computer Science University of California, Santa BarbaraSanta Barbara, CA 93106 USAwilliam@cs.ucsb.edu

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 422–426Vancouver, Canada, July 30 - August 4, 2017.c©2017 Association for Computational Linguisticshttps://doi.org/10.18653/v1/P17-2067Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 422–426Vancouver, Canada, July 30 - August 4, 2017.c©2017 Association for Computational Linguisticshttps://doi.org/10.18653/v1/P17-2067

https://www.aclweb.org/anthology/P17-2067.pdf

Another paper using the same dataset: https://www.aclweb.org/anthology/W18-5513/

The Data Challenge in Misinformation Detection:Source Reputation vs. Content Veracity Big Data and quality data for fake news and misinformation detection https://www.aclweb.org/anthology/W18-5502.pdf

Contains a good discussion of the varieties of fake news and various modes of news bias through history. https://journals.sagepub.com/doi/full/10.1177/2053951719843310 A paper discussing below datasets: Misinfo dataset: https://github.com/sfu-discourse-lab/MisInfoText https://github.com/sfu-discourse-lab/Misinformation_detection

Asr FT, Taboada M (2019) | 1,380 news articles | 4-way (false, true, mixture, no factual content) | Collected using a pivot Buzzfeed dataset. Focused on the US election topic. https://github.com/sfu-discourse-lab/Misinformation_detection/blob/master/buzzfeed-v02-originalLabels.txt.zip These stories have been scraped from the web, but are skewed towards true stories (The 1380 stories are labelled as follows: Mixed: 170, Mostly False: 64, Mostly True: 1090, No factual content: 56). These stories also seem to be poor training material as they have been adulterated by the scraping process that seemed to merge words together, especially words with apostrophes (ex: "I hope youdon mind my asking, butI curious" was recorded, rather than "I hope you don't mind my asking, but I'm curious.") This doesn't seem to be a suitable dataset.

Asr FT, Taboada M (2019) | 33 news articles | 4-way (false, true, mixture, no factual content) | Collected using a pivot Buzzfeed dataset. A variety of topics. https://github.com/sfu-discourse-lab/Misinformation_detection/blob/master/buzzfeed-top.csv.zip This selection from the top 50 Buzzfeed false stories has 34 stories, all are false, and contains story text that does not suffer from the adulteration described above.

Asr FT, Taboada M (2019) | 312 news articles | 5-way (false to true) | Collected from Snopes. Balanced by label. A variety of topics. Includes stance information (articles for or against a labeled claim). https://github.com/sfu-discourse-lab/Misinformation_detection/blob/master/snopes_checked_v02.csv.zip This is a collection of stories, as labelled by Snopes.com, with full text that has not been adulterated by the scraping process. The counts are: Mixed: 72, Mostly False: 53, False: 51, Mostly True: 72, True 65.

Using a Relu activation function:

Probably start with it by default and explore from there: https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units (LSTM is a kind of RNN)

https://arxiv.org/abs/1504.00941

Article on how to fix vanishing gradients problem:

Our initial models may suffer from this:

The vanishing gradients problem may be manifest in a Multilayer Perceptron by a slow rate of improvement of a model during training and perhaps premature convergence, e.g. continued training does not result in any further improvement. Inspecting the changes to the weights during training, we would see more change (i.e. more learning) occurring in the layers closer to the output layer and less change occurring in the layers close to the input layer."

https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function/

Feature engineering for various Text classification tasks: https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa

Paper on detecting deception via faint linguistic signals:

Deception is a complex, pervasive, sometimes high-stakeshuman activity; while the linguistic implementation of de-ceptive acts by speakers is sometimes brazen, sometimessubtle, it is a curious fact that it is difficult for other hu-mans to detect, and at any rate humans exhibit a distinctbias towards believing what they are told.3516 Although human performance at detecting deception is atchance, this research and prior research suggest that the un-conscious linguistic signals included in a conscious act ofdeceiving are sufficient to allow us to build automatic sys-tems capable of successfully distinguishing deceptive documents.

https://www.aclweb.org/anthology/L16-1558.pdf

Convolutional Neural Networks for Sentence Classification

"The model presented in the paper achieves good classification performance across a range of text classification tasks (like Sentiment Analysis) and has since become a standard baseline for new text classification architectures." https://arxiv.org/pdf/1408.5882v2.pdf

Implementing a CNN for Text Classification in TensorFlow

http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

This repo was needed for classmates to see and grade my work.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
Final Documents		Final Documents
deeplearningWithGlove300-20200922-1732.model		deeplearningWithGlove300-20200922-1732.model
deeplearningWithGlove300-20200922-1732.modeld		deeplearningWithGlove300-20200922-1732.modeld
deeplearningWithGlove300-20200922-1745.model		deeplearningWithGlove300-20200922-1745.model
deeplearningWithGlove300-20200922-1759.model		deeplearningWithGlove300-20200922-1759.model
deeplearningWithGlove300-20200923-1258.model		deeplearningWithGlove300-20200923-1258.model
deeplearningWithGlove300_TrainTestSplit_-20200923-1409.model		deeplearningWithGlove300_TrainTestSplit_-20200923-1409.model
deeplearningWithGlove300_TrainTestSplit_-20200923-1416.model		deeplearningWithGlove300_TrainTestSplit_-20200923-1416.model
.gitignore		.gitignore
ChartModelHistory.py		ChartModelHistory.py
Data Cleansing -- Advanced Data Science Capstone.ipynb		Data Cleansing -- Advanced Data Science Capstone.ipynb
Data Exploration--Advanced Data Science Capstone.ipynb		Data Exploration--Advanced Data Science Capstone.ipynb
DeepLearning with Pretrained Word2Vec vectors.ipynb		DeepLearning with Pretrained Word2Vec vectors.ipynb
DeepLearning with Word2Vec.ipynb		DeepLearning with Word2Vec.ipynb
FW 1--Add Tensorboard.ipynb		FW 1--Add Tensorboard.ipynb
FW_1_Add_Tensorboard_Colab_Version.ipynb		FW_1_Add_Tensorboard_Colab_Version.ipynb
FW_2_HyperparamTuning_Colab_Version.ipynb		FW_2_HyperparamTuning_Colab_Version.ipynb
FakeNewsTracker_Comparison.png		FakeNewsTracker_Comparison.png
Feature Engineering--Adv. Data Science Capstone.ipynb		Feature Engineering--Adv. Data Science Capstone.ipynb
Model Definition--Advanced Data Science Capstone.ipynb		Model Definition--Advanced Data Science Capstone.ipynb
Model Definition2--Advanced Data Science Capstone.ipynb		Model Definition2--Advanced Data Science Capstone.ipynb
Model Definition2--From github.ipynb		Model Definition2--From github.ipynb
NaiveBayesAndSVM.ipynb		NaiveBayesAndSVM.ipynb
NewsDataUnifier.py		NewsDataUnifier.py
README.md		README.md
RNN_model.py		RNN_model.py
Revised DeepLearning with Pretrained Embedding.ipynb		Revised DeepLearning with Pretrained Embedding.ipynb
StoryAnalysis.png		StoryAnalysis.png
Using Word2Vec.ipynb		Using Word2Vec.ipynb
Vocab_StoryLengthAnalysis.png		Vocab_StoryLengthAnalysis.png
conf.yml		conf.yml
deeplearningWithGlove300-20200922-1658.history		deeplearningWithGlove300-20200922-1658.history
deeplearningWithGlove300-20200922-1732.history		deeplearningWithGlove300-20200922-1732.history
deeplearningWithGlove300-20200922-1745.history		deeplearningWithGlove300-20200922-1745.history
deeplearningWithGlove300-20200922-1759.history		deeplearningWithGlove300-20200922-1759.history
deeplearningWithGlove300-20200923-1258.history		deeplearningWithGlove300-20200923-1258.history
deeplearningWithGlove300.history		deeplearningWithGlove300.history
deeplearningWithGlove300_TrainTestSplit_-20200923-1409.history		deeplearningWithGlove300_TrainTestSplit_-20200923-1409.history
deeplearningWithGlove300_TrainTestSplit_-20200923-1416.history		deeplearningWithGlove300_TrainTestSplit_-20200923-1416.history
dfTrueFalseNews.pkl		dfTrueFalseNews.pkl
dfTrueFalseNews_tokenized.pkl		dfTrueFalseNews_tokenized.pkl
dfTrueFalseNews_tokenized_nostopwords.pkl		dfTrueFalseNews_tokenized_nostopwords.pkl
embeddings.py		embeddings.py
master.py		master.py
model_evaluation_utils.py		model_evaluation_utils.py
pipeline.py		pipeline.py
reportresults.py		reportresults.py
requirements.txt		requirements.txt
tester.py		tester.py
text_preprocessing.py		text_preprocessing.py

adamx97/Data-Science-Advanced-Capstone

Folders and files

Latest commit

History

Repository files navigation

Data-Science-Advanced-Capstone

Using a Relu activation function:

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units (LSTM is a kind of RNN)

Article on how to fix vanishing gradients problem:

Convolutional Neural Networks for Sentence Classification

Implementing a CNN for Text Classification in TensorFlow

About

Resources

Stars

Watchers

Forks

Languages