EnviroNews

This project aims to model topics in environment news covered by various sources over several years. My goal here is to explore topics using unsupervised learning techniques and to assess their performance in detecting subtopics. These techniques include (1) matrix decomposition/factorization: e.g., NMF (Non-negative Matrix Factorization), LDA (Latent Dirichlet Allocation), PCA (Pricinpal Component Analysis) and (2) clustering algorithms (e.g., KMeans).

One critical assumption I made in this project was that each article can only be described by one topic, so one-hot encoding was used to categorize the articles. In reality, an article may touch upon many topics, and this nuance can certainly be captured by the model. However, this project focuses less on the subtlety of the topics but instead on the amount of coverage different news sources gave to these topics. As a result, this simplication made comparison between sources much easier.

Data Sources & Storage

I obtained full-text articles from NYTimes and Fox News to compare their coverage with each other. NYTimes was seleted specifically for its extensively developed API and Fox News due to it being a good comparison point to NYTimes. Additionally, I used NewsAPI to get articles from a plethora of sources up to 1 month old (free plan). Results of this project are displayed as an interactive Tableau dashboard in 3 tabs: (1) evolution of environmental topics in NYTimes over 16 years, (2) comparison between NYTimes vs. Fox News, and (3) topics distribution in articles obtained by NewsAPI.

The full-text articles are stored in MongoDB on an AWS-EC2 instance. MongoDB is a NoSQL database and uses JSON-like documents and syntax. As there is no definite data structure between NewsAPI output, Fox News website, and NYTimes API output, in addition to the long-form nature of full-text articles, MongoDB was selected to work with unstructured data and the articles are stored on an AWS-EC2 instance due to large file size. The CSVs in this GitHub only consists of the urls for the articles:

NYTimes: 13654 articles (08/2002-07/2018)
Fox News: 3132 articles (09/2012-08/2018)
NYTimes/Fox News subset (same time frame): 6876 articles (09/2012-08/2018)
NewsAPI (various sources): 20628 articles (09/2017-08/2018)

Notes

This project consists of three parts:

Getting URLs for environment-related articles (codes available here)
Downloading full-text articles from URLs using newspaper API and storing them on MongoDB on an AWS-EC2 instance
Using NLP techniques to model topics
Integrating modeling data into Tableau for visualization and comparison

Python packages required: pandas, numpy, seaborn, matplotlib, sklearn, pymongo

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code		code
urls-data		urls-data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

urls-data

urls-data

.gitignore

.gitignore

README.md

README.md

Repository files navigation

EnviroNews

Data Sources & Storage

Notes

About

Releases

Packages

Languages

LKchemposer/EnviroNews

Folders and files

Latest commit

History

Repository files navigation

EnviroNews

Data Sources & Storage

Notes

About

Topics

Resources

Stars

Watchers

Forks

Languages