Skip to content

Helsinki-NLP/XED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

XED

This is the XED dataset. The dataset consists of emotion annotated movie subtitles from OPUS. We use Plutchik's 8 core emotions to annotate. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines). The dataset is an ongoing project with forthcoming additions such as machine translated datasets. Please let us know if you find any errors or come across other issues with the datasets!

Citation

You can read more about XED in the following paper:

Öhman, E., Pàmies, M., Kajava, K. and Tiedemann, J., 2020. XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).

@inproceedings{ohman2020xed,
  title={{XED}: A Multilingual Dataset for Sentiment Analysis and Emotion Detection},
  author={{\"O}hman, Emily and P{\`a}mies, Marc and Kajava, Kaisla and Tiedemann, J{\"o}rg},
  booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
  year={2020}
}

Please cite this paper if you use the dataset.

Format

The files are formatted as follows:

sentence1\tlabel1,label2
sentence2\tlabel2,label3,label4...

Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable. Note that if you use our BERT code, it will re-arrange the original labels when you use 1-8 into 0-7 by switching trust:8->0

Metadata can be found in the metadata file and the projection "pairs" files. Access to detailed metadata can be found on the OPUS website. We recommend the use of OPUS Tools. Compatible augmentation data by expert annotators can be found for a selection of languages in the following repos:

NB! The number of annotated subtitle lines are the same as listed in the original paper. The original paper gives the number of annotations, not lines with annotations which is the format of the files here.

Evaluations

We used BERT to test the annotations for Finnish, English, and a handful of other languages with complete BERT models.

English annotated data

Number of annotations: 24164 + 9384 neutral
Number of unique data points: 17530 + 6420 neutral
Number of emotions: 8 (+pos, neg, neu)
Number of annotators: 108 (63 active)
data f1 accuracy
English without NER, BERT 0.530 0.538
English with NER, BERT 0.536 0.544
English NER with neutral, BERT 0.467 0.529
English NER binary with surprise, BERT 0.679 0.765
English NER true binary, BERT 0.838 0.840
English NER, one-vs-rest Linear SVC 0.502 0.650-0.789 / class

Multilingual projections

And for the other languages with more than 950 lines using SVM:

LANG SIZE AVG_LEN ANGER ANTICIP. DISGUST FEAR JOY SADNESS SURPRISE TRUST 1label 2labels 3labels 4+labels F1_SVM
AR 3590 30.02 1012 839 478 565 561 536 615 589 65.01 26.94% 6.74% 1.31% 0.5729
BG 6974 41.3 1923 1630 891 1051 1174 1112 1166 1239 64.01 27.89% 6.62% 1.48% 0.6069
BR 12295 38.49 3228 2846 1641 1821 2128 2025 2121 2098 64.69 27.02% 6.66% 1.63% 0.6726
BS 2443 33.13 632 571 294 367 428 394 397 399 65.98 26.65% 6.47% 0.9% 0.5854
CN 1395 10.92 315 315 140 180 288 221 242 266 66.31 27.46% 5.16% 1.08% 0.5004
CS 6511 29.94 1728 1615 807 1035 1045 1011 1110 1091 64.64 27.42% 6.63% 1.31% 0.6263
DA 1838 31.03 447 472 193 218 350 282 294 351 66.59 26.17% 6.2% 1.03% 0.5989
DE 5503 50.24 1492 1304 742 790 938 889 905 904 64.96 27.11% 6.6% 1.33% 0.6059
EL 8083 35.22 2238 1956 1070 1162 1369 1273 1345 1367 64.25 27.58% 6.73% 1.45% 0.6192
ES 11303 35.69 3007 2631 1482 1765 1902 1810 1959 1924 64.52 27.22% 6.59% 1.66% 0.676
ET 1476 28.66 370 396 144 218 280 210 222 255 65.58 27.57% 6.17% 0.68% 0.5449
FI 8289 29.11 2175 2010 1014 1281 1503 1243 1383 1447 64.3 27.8% 6.38% 1.52% 0.5859
FR 7306 41.27 1946 1726 994 1127 1256 1200 1198 1259 63.63 28.02% 6.86% 1.49% 0.6257
HE 4449 28.97 1244 1078 551 658 791 681 754 783 63.34 28.37% 6.74% 1.55% 0.598
HR 5941 31.7 1494 1408 724 978 1029 947 991 1052 64.13 28.24% 6.26% 1.36% 0.6503
HU 5777 32.07 1539 1378 715 925 937 899 989 1028 64.19 27.77% 6.63% 1.42% 0.5978
IS 977 29.55 236 230 121 124 175 168 134 180 66.84 27.12% 5.32% 0.72% 0.5416
IT 6552 44.65 1783 1514 887 1092 1011 1122 1065 1104 63.58 28.4% 6.59% 1.42% 0.6907
MK 300 28.9 58 100 33 36 61 53 64 52 58.67 31.0% 9.67% 0.67% 0.4961
NL 5333 33.93 1392 1337 658 822 878 857 942 927 64.22 27.21% 6.86% 1.71% 0.614
NO 4257 31.1 1051 1029 500 584 822 678 731 712 65.09 27.93% 5.68% 1.29% 0.5771
PL 7179 32.44 1966 1707 964 1121 1206 1119 1199 1220 64.03 27.72% 6.69% 1.56% 0.6233
PT 7220 33.72 1890 1710 906 1101 1260 1210 1234 1257 63.85 27.87% 6.86% 1.43% 0.6203
RO 9474 36.88 2543 2181 1258 1433 1563 1568 1579 1608 64.9 27.07% 6.58% 1.45% 0.6387
RU 2377 32.45 564 590 268 423 376 395 416 405 64.7 27.6% 6.6% 1.09% 0.5976
SK 975 59.82 256 234 99 168 168 153 152 159 65.44 28.0% 5.54% 1.03% 0.5305
SL 2680 29.19 679 694 278 402 456 416 481 419 65.52 27.61% 5.6% 1.27% 0.6015
SR 8984 31.69 2365 2163 1131 1282 1652 1399 1519 1565 64.3 27.58% 6.72% 1.39% 0.6566
SV 4905 44.34 1273 1160 591 691 815 831 866 827 65.3 27.01% 6.48% 1.2% 0.6218
TR 9202 35.95 2423 2243 1212 1339 1610 1469 1589 1628 63.64 28.03% 6.71% 1.63% 0.608
VI 956 34.53 245 224 128 141 187 150 144 178 63.28 28.56% 7.11% 1.05% 0.5594

Related publications:

Some preliminary and related work has also been discussed in the following papers:

  • Öhman, E., Kajava, K., Tiedemann, J. and Honkela, T., 2018, October. Creating a dataset for multilingual fine-grained emotion-detection using gamification-based annotation. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (pp. 24-30).
  • Öhman, E.S. and Kajava, K.S., 2018. Sentimentator: Gamifying fine-grained sentiment annotation. Digital Humanities in the Nordic Countries 2018.
  • Kajava, K.S., Öhman, E.S., Hui, P. and Tiedemann, J., 2020. Emotion Preservation in Translation: Evaluating Datasets for Annotation Projection. In Digital Humanities in the Nordic Countries 2020. CEUR Workshop Proceedings.
  • Öhman, E., 2020. Challenges in Annotation: Annotator Experiences from a Crowdsourced Emotion Annotation Task. In Digital Humanities in the Nordic Countries 2020. CEUR Workshop Proceedings.

License: Creative Commons Attribution 4.0 International License (CC-BY)