Skip to content

Latest commit

 

History

History
20 lines (19 loc) · 2.59 KB

README.md

File metadata and controls

20 lines (19 loc) · 2.59 KB

Using sentiment induction to understand variation in gendered online communities

Abstract

We analyze gendered communities defined in three different ways: text, users, and sentiment. Differences across these representations reveal facets of communities' distinctive identities, such as social group, topic, and attitudes. Two communities may have high text similarity but not user similarity or vice versa, and word usage also does not vary according to a clearcut, binary perspective of gender. Community-specific sentiment lexicons demonstrate that sentiment can be a useful indicator of words' social meaning and community values, especially in the context of discussion content and user demographics. Our results show that social platforms such as Reddit are active settings for different constructions of gender.

Our paper can be found here: link TBD.

Setup

This directory is built on top of SocialSent. To run several of the code files, you should first download this socialsent folder and place it inside the code folder.

Data

We used Reddit comments between May 2016 and April 2017 from nine gendered communities that are within the most popular 400 subreddits: r/actuallesbians, r/askgaybros, r/mensrights, r/askmen, r/askwomen, r/xxfitness, r/femalefashionadvice, r/malefashionadvice, and r/trollxchromosomes. We used a dataset provided by the Stanford Infolab, but Reddit comment data is also available publicly in various forms: on BigQuery here or via download with an API here.

Code

  • clustering.py contains code for clustering user and text representations of subreddits.
  • create_docs.py concatenates reddit comments into large documents, one per subreddit
  • create_subreddit_list.py shows how we narrowed down to our target subreddits
  • misalignment.py examines differences between text and user representations
  • pipeline.py creates sentiment lexicons with SentProp
  • plot_sim_correlations.ipynb contains analysis and plots of sentiment
  • subreddit_counts.py calculates basic statistics about our data
  • variance_sentiment.py finds words with high variance in sentiment across subreddits

Lexicons

The induced sentiment lexicons we analyzed in our paper can be found here. We also include our PPMI-SVD word vectors for each subreddit in ppmi_svd_vectors.zip and word frequencies in vocab_counts.zip.