Code for the TACL paper, Characterizing English Variation across Social Media Communities with BERT. Citation forthcoming.
The name of this repository if "ingroup_lang" since it is about in-group language.
Apache Spark 2.4.3, PyTorch 1.6.0, transformers 3.3.1, Python 3.7
See requirements.txt for more details. Some code from early on in the project may be in Python 2.7, but I have tried to upgrade all instances to Python 3.7, but may have missed some, just let me know.
Data preprocessing, wrangling, and organizing
- data_organize.py
- dataset_statistics.py
- get_sense_vocab.py
- langid.py
- language_id.py
- language_id_helper.py
- tokenizer.sh
- tokenizer_helper.py
SemEval experiments
- bert_vectors.py
- bert_post.py
- cluster_vectors.py
Clustering and matching of word embeddings on Reddit data
- bert_cluster_train.py: clustering 1 word at a time
- bert_cluster_match.py: matching 1 subreddit at a time
- analyze_bert.py: visualization
- spectral.py
Amrami & Goldberg 2019 fork
- The repo here
- Thank you to Asaf Amrami for making your code accessible
Community language metrics
- sense_pmi.py
- textrank.py
- word_rarity.py
Glossary analysis
- glossary_eval.py
- senses.ipynb
Community behavior analysis
- comment_networks.py
- comment_networks_helper.py
- loyalty.py
- sociolect_score_analysis.ipynb
- users.py
- users_sociolect_analysis.py
We used two months of data, May and June 2019, from Pushshift's collection of Reddit comments. If you would like the sampled comments (80k per subreddit) that Lucy used, email her since they are too big for Github.
Download SemEval 2013 Task 13 data: here. You should get a folder called "SemEval-2013-Task-13-test-data" that contains test data.
The ukwac corpus for training SemEval 2013 can be found here, you may need to contact the owners to get a downloaded version.
Download SemEval 2010 Task 14 data: here. You should get a folder called "semeval-2010-task-14" that contains training and test data.
Subreddit glossaries, as csvs, are also in this folder.
This folder contains some of the outputs. There are several files also listing some of the community attributes of each subreddit in our dataset.
- base_most_sense_pmi are pmi scores, largest to smallest, for BERT-base k-means
- ag_most_sense_pmi are pmi scores, largest to smallest, for Amrami & Goldberg model
- norm_pmi are type pmi scores, smallest to largest