🎊 PoliBERTweet: Language Models for Political Tweets

Transformer-based language models pre-trained on a large amount of politics-related Twitter data (83M tweets). This repo is the official resource of the following paper.

PoliBERTweet: A Pre-trained Language Model for Analyzing Political Content on Twitter, LREC 2022.

📚 Data Sets

The data sets for the evaluation tasks presented in our paper are available below.

Poli-Test & NonPoli-Test - [Download]
Stance Data Sets - [Download] [Paper] [Github]

🚀 Pre-trained Models

All models are uploaded to my Huggingface 🤗 so you can load model with just three lines of code!!!

PoliBERTweet (83M tweets) - Feel free to fine-tune this to any downstream task 🎯
PoliBERTweet-small (5M tweets)

⚙️ Usage

We tested in pytorch v1.10.2 and transformers v4.18.0.

To fine-tune our models for a specific task (e.g. stance detection), see the HuggingFace Doc
Please see specific model pages above for more usage details. Below is a sample use case.

1. Load the model and tokenizer

from transformers import AutoModel, AutoTokenizer, pipeline
import torch

# Choose GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Select mode path here
pretrained_LM_path = "kornosk/polibertweet-mlm"

# Load model
tokenizer = AutoTokenizer.from_pretrained(pretrained_LM_path)
model = AutoModel.from_pretrained(pretrained_LM_path)

2. Predict the masked word

# Fill mask
example = "Trump is the <mask> of USA"
fill_mask = pipeline('fill-mask', model=pretrained_LM_path, tokenizer=tokenizer)

outputs = fill_mask(example)
print(outputs)

3. See embeddings

# See embeddings
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(outputs)

# OR you can use this model to train on your downstream task!
# please consider citing our paper if you feel this is useful :)

4. Fine-tune to a downstream task like stance detection

See details in the HuggingFace Doc.

✏️ Citation

If you feel our paper and resources are useful, please consider citing our work! 🙏

@inproceedings{kawintiranon2022polibertweet,
  title     = {{P}oli{BERT}weet: A Pre-trained Language Model for Analyzing Political Content on {T}witter},
  author    = {Kawintiranon, Kornraphop and Singh, Lisa},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
  year      = {2022},
  pages     = {7360--7367},
  publisher = {European Language Resources Association},
  url       = {https://aclanthology.org/2022.lrec-1.801}
}

🛠 Throubleshoots

Create an issue here if you have any issues loading models or data sets.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

🎊 PoliBERTweet: Language Models for Political Tweets

📚 Data Sets

🚀 Pre-trained Models

⚙️ Usage

1. Load the model and tokenizer

2. Predict the masked word

3. See embeddings

4. Fine-tune to a downstream task like stance detection

✏️ Citation

🛠 Throubleshoots

About

License

GU-DataLab/PoliBERTweet

Folders and files

Latest commit

History

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

🎊 PoliBERTweet: Language Models for Political Tweets

📚 Data Sets

🚀 Pre-trained Models

⚙️ Usage

1. Load the model and tokenizer

2. Predict the masked word

3. See embeddings

4. Fine-tune to a downstream task like stance detection

✏️ Citation

🛠 Throubleshoots

About

Topics

Resources

License

Stars

Watchers

Forks