Skip to content

KChalk/RedditProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Welcome to my Reddit Repo

Here you'll find pyspark code to:

  • read and process reddit submission data
  • calculate frequencies of words from various collections select posts and subreddits

and Python code which uses t-SNE to plot the relationship between subreddits as described by word collection frequencies, as well as a powerpoint explaining t-SNE and the application of t-SNE to a few other domains.

This represents my (Kendra Chalkley's) course work and pet project from portions of my MS CS at Oregon Health and Science University's Center for Spoken Language Understanding. (It's important to mention this, because I'm currently looking for my first job since completing this degree and hopefully someone has made it this far as a result of my resume...)

Important citations for this work include:

  • files.pushshift.io which hosts compressed collections of reddit data, pre-harvested from the reddit API
  • In an Absolute State by Al-Mosaiwi and Johnstone was the initial inspiration for the project. Their absolutist dictionary is one of the word collections used throughout the project. The others are from LIWC collections.
  • t-SNE visualization was one of the most sucessful aspects of project, for a variety of timing reasons. I gave a presenation explaining the algorithm which is available in the tsne sub folder of this repo, but it lacks narrative, which is instead available from the author's Google Techtalk

Presentation notes and notebook improvements are the next anticipated updates to this folder, and will be delayed only by a competing need to write cover letters.

About

For consolidating, organizing, and improving code from various other repos under other classes and projects.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published