Predicting Political Lean of a News Article

Stanley Stevens // December 2016 // GA Class Project

OVERVIEW

Idea

Use previously labeled (with political bias) news articles to train a supervised logistic regression model to classify new/other articles, with the ultimate goal of increasing self-awareness of my personal political views and grow into a more balanced and nuanced perspective (with the underlying assumption that “you are what you read”).

Resources

Github/Jupyter Notebooks - see above
Training Data - source
Untrained Data - personal political articles I’ve read (count: 310 articles) - source
More details (including exploration notes) - source
Presentation - pdf / slides

Data Dictionary

political lean: political bias of article (left, lean left, center, lean right, right, mixed not rated)
cleaned_text: article text (no html)
ugly_text: article text (including html)
url_raw: full url of article
url_clean: full url minus key/value pairs at end of url string
url_domain: host/domain of article (e.g. cnn.com)
title: title of article
meta_description: mini summary of article
issue: topic of article (e.g. economy, election, environment, healthcare, etc.)

Model Selection

Knn: by far the worst of the three models, with average accuracy scores in the 0.2 and 0.3s
MultinomialNB: I explored Multinomial Naive Bayes up front using a number of different features and parameters, but it always seemed to underperform logistic regression by roughly 10-30% (though it was much faster, as to be expected)
Logistic Regression
- With the count vectorized params mentioned above: ngram and min_df)
- I ended up exploring two models both using logistic regression but with different feature sets: Model A: Text+Domain+Url & Model B: Domain (only)
- These two models had varying results. As you can see below, model B (domain only) seems to suggest that I read very few ‘Right’ articles, though model A (domain/url + text) suggests a more balanced reading. Anecdotally (by simply knowing what I read), I’d say it’s actually somewhere in the middle (mostly ‘lean left’, with maybe 10-15% right or right leaning articles) - though the point of this exercise is to gain a higher level of self-awareness, so my thought process would very well be biased in and of itself.

Conclusion

###Challenges

Collection process
- First attempt at html content failed as it included political bias tags leading to an overfit model.
- The overall collection process took approximately 10-15 human hours and somewhere between 50-100 processing hours.
Despite a high accuracy (0.97), model B (logreg) is potentially overfitting using url/domain
Over classification of Right (possibly due to more ‘Right’ articles, need to explore more)

###Successes

Had a mixed accuracy for model A (logreg) of 0.91, though when I applied it to my own (untrained) data, I was less confident in some of the classification

###Applied Solutions (future work)

As mentioned above, I used 310 articles that I previously classified as ‘politics’, and it seems to be mostly correct (anecdotally about 65-75%), which with some further improvement, I will apply to my reading habits/tracking website.
310 Articles - data source (csv)
I would also like to connect it to facebook and twitter (and pull in articles they’ve posted) to allow people to see where they stand from a political bias perspective
A next step for both of the above will be to suggest articles from a different perspective so as to get a more balanced and nuanced view of the world.
It will also probably be useful to create a model that detects if an article is a political article, so I can run it against any news article and accurately predict if this model is even relevant.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Exploration - Part 2.ipynb		Exploration - Part 2.ipynb
Exploration - Part 3.ipynb		Exploration - Part 3.ipynb
Exploration - Part 4.ipynb		Exploration - Part 4.ipynb
Exploration - Part 5.ipynb		Exploration - Part 5.ipynb
Exploration.ipynb		Exploration.ipynb
Final Model.ipynb		Final Model.ipynb
Goose Extraction.ipynb		Goose Extraction.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration - Part 2.ipynb

Exploration - Part 2.ipynb

Exploration - Part 3.ipynb

Exploration - Part 3.ipynb

Exploration - Part 4.ipynb

Exploration - Part 4.ipynb

Exploration - Part 5.ipynb

Exploration - Part 5.ipynb

Exploration.ipynb

Exploration.ipynb

Final Model.ipynb

Final Model.ipynb

Goose Extraction.ipynb

Goose Extraction.ipynb

README.md

README.md

Repository files navigation

Predicting Political Lean of a News Article

OVERVIEW

Idea

Resources

Data Dictionary

Model Selection

Conclusion

About

Releases

Packages

Languages

Stanleyyork/political_lean_prediction

Folders and files

Latest commit

History

Repository files navigation

Predicting Political Lean of a News Article

OVERVIEW

Idea

Resources

Data Dictionary

Model Selection

Conclusion

About

Resources

Stars

Watchers

Forks

Languages