Skip to content

lucy3/ingroup_lang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Language Variation

Code for the TACL paper, Characterizing English Variation across Social Media Communities with BERT. Citation forthcoming.

The name of this repository if "ingroup_lang" since it is about in-group language.

Package versions

Apache Spark 2.4.3, PyTorch 1.6.0, transformers 3.3.1, Python 3.7

See requirements.txt for more details. Some code from early on in the project may be in Python 2.7, but I have tried to upgrade all instances to Python 3.7, but may have missed some, just let me know.

Repository Map

Code

Data preprocessing, wrangling, and organizing

  • data_organize.py
  • dataset_statistics.py
  • get_sense_vocab.py
  • langid.py
  • language_id.py
  • language_id_helper.py
  • tokenizer.sh
  • tokenizer_helper.py

SemEval experiments

  • bert_vectors.py
  • bert_post.py
  • cluster_vectors.py

Clustering and matching of word embeddings on Reddit data

  • bert_cluster_train.py: clustering 1 word at a time
  • bert_cluster_match.py: matching 1 subreddit at a time
  • analyze_bert.py: visualization
  • spectral.py

Amrami & Goldberg 2019 fork

  • The repo here
  • Thank you to Asaf Amrami for making your code accessible

Community language metrics

  • sense_pmi.py
  • textrank.py
  • word_rarity.py

Glossary analysis

  • glossary_eval.py
  • senses.ipynb

Community behavior analysis

  • comment_networks.py
  • comment_networks_helper.py
  • loyalty.py
  • sociolect_score_analysis.ipynb
  • users.py
  • users_sociolect_analysis.py

Data

We used two months of data, May and June 2019, from Pushshift's collection of Reddit comments. If you would like the sampled comments (80k per subreddit) that Lucy used, email her since they are too big for Github.

Download SemEval 2013 Task 13 data: here. You should get a folder called "SemEval-2013-Task-13-test-data" that contains test data.

The ukwac corpus for training SemEval 2013 can be found here, you may need to contact the owners to get a downloaded version.

Download SemEval 2010 Task 14 data: here. You should get a folder called "semeval-2010-task-14" that contains training and test data.

Subreddit glossaries, as csvs, are also in this folder.

Logs

This folder contains some of the outputs. There are several files also listing some of the community attributes of each subreddit in our dataset.

  • base_most_sense_pmi are pmi scores, largest to smallest, for BERT-base k-means
  • ag_most_sense_pmi are pmi scores, largest to smallest, for Amrami & Goldberg model
  • norm_pmi are type pmi scores, smallest to largest

About

Code for 2021 TACL paper on community-specific language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published