Hummingbird

Hummingbird dataset and code for EMNLP 2021 paper "Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica"

Hummingbird Dataset

Dataset is under data/hummingbird folder. It contains annotated texts for all the eight styles (politeness, sentiment, anger, disgust, fear, joy, and sadness).

Below is the explanation of each column

human_label = annotator's style label for the text
- 0 if the text is polite, positive, anger, disgust, fear, joy, or sad,
- 1 if the text is impolite, negative, not anger, etc,
- 0.5 if neutral for "politeness" and "sentiment"
orig_text = original text
processed_text = text after preprocessing (lower case and removal of some punctuations)
perception_scores = human's perception label for the tokens in processed_text

Example aggregation of the whole dataset can be seen in hummingbird_data.html. Blocked text means 3 people agree, blue/red text means 2 people agree, gray text means only 1 person annotate the word as important stylistic cue.

token_avg

This directory contains a list of words for each style with their corresponding count (count), average perception scores (avg_attr), and their standard deviation (std_attr). Ignore the last column (avg_pred).

A Subset of Existing Datasets

A subset of benchmarking dataset is under data/orig folder. It has word importance scores from Captum.

pred_class = predicted label,
- 0 if the text is polite, positive, anger, disgust, fear, joy, or sad,
- 1 if the text is impolite, negative, not anger, etc.,
pred_prob = prediction probability
raw_input = tokenized text by BERT
attribution_scores = word importance scores by integrated gradients from Captum

These existing datasets are extracted from the following previous works:

Style	Name	Link
Politeness	StanfordPoliteness	link
Sentiment	SentiTreeBank	link
Offensiveness	Tweet Datasets for Hate Speech and Offensiveness (HateOffensive)	link
Emotion	SemEval 2018	link

Download these datasets which have been preprocessed here!

token_avg

This directory contains a list of words for each style with their corresponding count (count), average attribution scores (avg_attr), their standard deviation (std_attr), and average prediction probability - if it's closer to one then the label is more positive/polite/higher emotion, etc. (avg_pred).

Code

extract_tokens_from_bert_data.py: code for aggregating BERT's tokenized tokens and create the [style]_features.tsv file for analysis.

model

training.py: code for training the model
- Example command for running the code for joy emotion:
python training.py --input_dir ../dataset/emotion_semeval/joy --task_name emotion --output_dir ../model/joy --temp_dir tmp_out/new_tmp_joy
captum_label.py: code for testing the model and obtaining word importance scores (attribution scores).
- Example command for running the code for joy emotion:
python captum_label.py --data_dir ../dataset/emotion_semeval/joy --model_type bert --emotion joy --do_eval --do_interpret --do_lower_case --model_name_or_path ../model/joy --output_dir ../model/joy --eval_dataset ../dataset/emotion_semeval/joy/dev.tsv
- Example command for running the code for politeness:
python captum_label.py --data_dir ../dataset/StanfordPoliteness/ --model_type bert --task StanfordPoliteness --do_eval --do_interpret --do_lower_case --model_name_or_path ../model/politeness --output_dir ../model/politeness --eval_dataset ../dataset/StanfordPoliteness/dev.tsv

BibTex

# coding=utf-8

@InProceedings{hayati-etal-2021hummingbird,
  author = 	"Hayati, Shirley Anugrah
		and Kang, Dongyeop
		and Ungar, Lyle",
  title = 	"Does BERT Learn as Humans Perceive? Understanding Linguistic Styles through Lexica",
  booktitle = 	"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
  year = 	"2021",
  publisher = 	"Association for Computational Linguistics",
  location = 	"Punta Cana, Dominican Republic",
  url = 	"https://arxiv.org/pdf/2109.02738.pdf"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

README.md

README.md

human_vs_bert.png

human_vs_bert.png

hummingbird_data.html

hummingbird_data.html

Repository files navigation

Hummingbird

Hummingbird Dataset

token_avg

A Subset of Existing Datasets

token_avg

Code

model

BibTex

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code		code
data		data
README.md		README.md
human_vs_bert.png		human_vs_bert.png
hummingbird_data.html		hummingbird_data.html

sweetpeach/hummingbird

Folders and files

Latest commit

History

Repository files navigation

Hummingbird

Hummingbird Dataset

token_avg

A Subset of Existing Datasets

token_avg

Code

model

BibTex

About

Resources

Stars

Watchers

Forks

Languages