Reddit Language Variation

Code for the TACL paper, Characterizing English Variation across Social Media Communities with BERT. Citation forthcoming.

The name of this repository if "ingroup_lang" since it is about in-group language.

Package versions

Apache Spark 2.4.3, PyTorch 1.6.0, transformers 3.3.1, Python 3.7

See requirements.txt for more details. Some code from early on in the project may be in Python 2.7, but I have tried to upgrade all instances to Python 3.7, but may have missed some, just let me know.

Repository Map

Code

Data preprocessing, wrangling, and organizing

data_organize.py
dataset_statistics.py
get_sense_vocab.py
langid.py
language_id.py
language_id_helper.py
tokenizer.sh
tokenizer_helper.py

SemEval experiments

bert_vectors.py
bert_post.py
cluster_vectors.py

Clustering and matching of word embeddings on Reddit data

bert_cluster_train.py: clustering 1 word at a time
bert_cluster_match.py: matching 1 subreddit at a time
analyze_bert.py: visualization
spectral.py

Amrami & Goldberg 2019 fork

The repo here
Thank you to Asaf Amrami for making your code accessible

Community language metrics

sense_pmi.py
textrank.py
word_rarity.py

Glossary analysis

glossary_eval.py
senses.ipynb

Community behavior analysis

comment_networks.py
comment_networks_helper.py
loyalty.py
sociolect_score_analysis.ipynb
users.py
users_sociolect_analysis.py

Data

We used two months of data, May and June 2019, from Pushshift's collection of Reddit comments. If you would like the sampled comments (80k per subreddit) that Lucy used, email her since they are too big for Github.

Download SemEval 2013 Task 13 data: here. You should get a folder called "SemEval-2013-Task-13-test-data" that contains test data.

The ukwac corpus for training SemEval 2013 can be found here, you may need to contact the owners to get a downloaded version.

Download SemEval 2010 Task 14 data: here. You should get a folder called "semeval-2010-task-14" that contains training and test data.

Subreddit glossaries, as csvs, are also in this folder.

Logs

This folder contains some of the outputs. There are several files also listing some of the community attributes of each subreddit in our dataset.

base_most_sense_pmi are pmi scores, largest to smallest, for BERT-base k-means
ag_most_sense_pmi are pmi scores, largest to smallest, for Amrami & Goldberg model
norm_pmi are type pmi scores, smallest to largest

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
code		code
data		data
logs		logs
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

logs

logs

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Reddit Language Variation

Package versions

Repository Map

Code

Data

Logs

About

Releases

Packages

Languages

lucy3/ingroup_lang

Folders and files

Latest commit

History

Repository files navigation

Reddit Language Variation

Package versions

Repository Map

Code

Data

Logs

About

Resources

Stars

Watchers

Forks

Languages