Skip to content

A useful repository for calculating classification baselines using Bert

Notifications You must be signed in to change notification settings


Repository files navigation

Document or sequence classification via Bert

A constantly evolving document. Calculation of baselines for various datasets used in my NLP research and related projects. No hyperparameter optimization has been carried out to calculate these results, unless otherwise stated.


BSC: Bert for Sequence classification (class transformers.BertForSequenceClassification)

Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output)

Useful links:

BWA: Bert for Sequence classification with word attention

Bert Model transformer with a sequence classification head on top (a layer with word attention on the tokens of the sequence (CLS included))

Implementation of section "2.2 Hierarchical Attention > Word Attention" in Hierarchical Attention Networks for Document Classification Adaptation of the class transformers.BertForSequenceClassification

Useful links:



Multilabel datasets:

Dataset name # total # train # val. # test max. (& avg.) depth # labels Type of classification
WoS (Web of Science) 46,985 30,070 7,518 9,397 2 (2.0) 141 Article classification by topic
wikivitals-lvl4 10,011 6,407 1,602 2,003 3 (--) 587 Article classification by topic
wikivitals-lvl5 Article classification by topic
wikivitals-lvl5 (my version) Article classification by topic

WikiVitals (level 4)

Description (from NetSet): Vital articles of Wikipedia in English (level 4) with [...] words used in summaries (tokenization by Spacy, model "en_core_web_lg").

  • Associated task: classification (single-label classification, multilabel classification)
    • classification of the articles according to their topic. Each article has 1 or more label that corresponds to a unique path in a hierarchy of labels
  • Domain: Research / Education
  • Type: Real
  • Instance count: 10,011
  • Data types: String, Numeric
  • Missing values: No
  • Dataset infos and download: NetSet - WikiVitals (en) (texts not available)
  • Source: Wikivitals Level 4 (the source has changed since the dataset creation in June 2021)


Model max. # tokens micro-F1 macro-F1 config. id Comments
BSC 16 72.69 19.48 wikivitals
BSC 128 85.74 37.36 wikivitals
BSC 512 85.99 34.40 wikivitals
BWA 16 wv4_16_BWA
BWA 128 85.94 36.18 wv4_128_BWA
BWA 512 87.16 37.72 wv4_512_BWA

wikiVitals-lvl5-04-2022 (our own)

Description: Vital articles of Wikipedia in English (level 5) with words used in summaries.

  • Associated task: classification (single-label classification, multilabel classification)
    • classification of the articles according to their topic. Each article has 3 labels that corresponds to a unique path in a hierarchy of labels
  • Domain: Research / Education
  • Type: Real
  • Instance count: 48,512
  • Data types: String, Numeric
  • Missing values: No
  • Dataset infos and download: my Github repo
  • Source: complete dump from April, 2022



Split train/validation/test: 81%/9%/10%. Data split in a stratified way.

Level 0 (11 classes)

Model max. # tokens Accuracy config. id Comments
BSC 128 95.83 wv-lvl5-04-2022_128_BSC_label0
BSC 512 95.17 wv-lvl5-04-2022_512_BSC_label0
BWA 128 95.57 wv-lvl5-04-2022_128_BWA_label0

Level 1 (32 classes)

Model max. # tokens Accuracy config. id Comments
BSC 128 89.42 wv-lvl5-04-2022_128_BSC_label1
BSC 512
GMNN w/ FAGCN -- 87.92 (0.31) using 0/1 valued representations

Level 2 (251 classes)

Method max. # tokens Accuracy config. id Comments
BSC 128
BSC 512

Web of Science


  • Associated task: classification (single-label classification, multilabel classification)
    • classification of the articles according to their topic. Each article has 2 labels that corresponds to a unique path in a hierarchy of labels
  • Domain: Research / Education
  • Type: Real
  • Instance count: 46,985
  • Data types: String, Numeric
  • Missing values: No
  • Dataset infos and download: to be completed


Method max. # tokens micro-F1 macro-F1 config. id Comments
BSC 512 85.51 78.10 wos
BSC 512 86.33 76.77 wos
BWA 512 wos
Wang et al. (2022) 512 85.63 79.07 wos
Chen et al. (2021) 512 86.26 80.58 wos


EEEC - Enriched Equity Evaluation Corpus

Description: EEC (Equity Evaluation Corpus) (Kiritchenko and Mohammad 2018) is a benchmark data set, designed for examining inappropriate biases in system predictions, and it consists of 8,640 English sentences chosen to tease out Racial and Gender related bias. Each sentence is labeled for the mood state it conveys, a task also known as Profile of Mood States (POMS). Each of the sentences in the data set is composed using one of eleven templates, with placeholders for a person’s name and the emotion it conveys. Designed as a bias detection benchmark, the sentences in EEC are very concise, which can make them not useful as training examples. If a classifier sees in training only a small number of examples, which differ only by the name of the person and the emotion word, it could easily memorize a mapping between emotion words and labels, and will not learn anything else. To solve this and create a more representative and natural data set for training, we expand the EEC data set, creating an enriched data set which we denote as Enriched Equity Evaluation Corpus, or EEEC. In this data set, we use the 11 templates of EEC and randomly add a prefix or suffix phrase, which can describe a related place, family member, time, and day, including also the corresponding pronouns to the Gender of the person being discussed. We also create 13 non-informative sentences, and concatenate them before or after the template such that there is a correlation between each label and three of those sentences.16 This is performed so that we have other information that could be valuable for the classifier other than the person’s name and the emotion word. Also, to further prevent memorization, we include emotion words that are ambiguous and can describe multiple mood states. Our enriched data set consists of 33,738 sentences generated by 42 templates that are longer and much more diverse than the templates used in the original EEC. While still synthetic and somewhat unrealistic, our data set has much longer sentences, has more features that are predictive of the label, and is harder for the classifier to memorize.

  • Associated task: classification (single-label classification)
    • classification of the sentences according to the 'gender', 'race' or 'POMS' (profile of mood states)
  • Domain: Research / Education
  • Type: Synthetic
  • Instance count: 33738 sentences (according to paper)
  • Data types: String, Numeric
  • Missing values: ~ (race attributed randomly when missing in the 'gender treatement' splits)
  • Dataset infos and download: CausaLM repository

Introduced in: Feder, A., Oved, N., Shalit, U., & Reichart, R. (2021). Causalm: Causal model explanation through counterfactual language models. Computational Linguistics, 47(2), 333-386.

Personal notes:

The splits provided by the authors of the "CausaLM" article contain pairs of 'factual' and 'counterfactual' examples. For the evaluation of a model's ability to predict 'gender', 'race' or mood state ('POMS'), this notion of pairs is unnecessary. So, for each split in the dataset we collected the unique instances they contain, an instance being either a factual example or a counterfactual example in the data provided. Below are the statistics for the distribution of these unique instances in the different splits and the 'overlaps' between the different splits (i.e. the rate of unique instances that appear in both splits of the complete dataset).

*Gender as a treatment: * Total number of unique observations: 30,055 unique sentences Number of unique observations per split and overlap with the other splits:

  • train: 25,169 unique sentences (overlap w/ validation: 6,796, w/ test: 8,184)
  • validation: 9,505 unique sentences (overlap w/ train: 6,796, w/ test: 3,157)
  • test: 11,422 unique sentences (overlap w/ train: 8,184, w/ validation: 3,157)
  • overlap between 'train + validation' and 'test': 9,245 (~81% of the train set) For evaluation, I build a train/test/split that has no overlap between the different splits of the dataset with the following characteristics:
Total number of instances #train #validation #test Comments
30,005 25,169 2,709 2,177
83.88% 9.03% 7.26%

POMS distribution in sets (train, validation, test):

  • anger: 22.38% - 23.07% - 21.68%
  • fear: 23.25% - 23.48% - 25.17%
  • joy: 23.42% - 24.10% - 23.89%
  • sadness: 23.43% - 23.77% - 22.88%
  • neutral: 7.51% - 5.57% - 6.38%

Race_label distribution in sets (train, validation, test):

  • African-American: 49.97% - 52.05% - 51.17%
  • European: 50.03% - 47.95% - 48.83%

Gender distribution in sets (train, validation, test):

  • male: 49.97% - 49.72% - 50.53%
  • female: 50.03% - 50.28% - 49.47%


Gender treatment POMS

Method max. # tokens Accuracy config. id Comments
BSC 128
BWA 64 94.35 EEEC-gender_64_BWA_POMS


To do a training (or an evaluation), run the following command:

# Training of a model using the configuration file named config_{dataset_id}.yml
> python --c dataset_id
# Evaluation of a model using the configuration file named config_{dataset_id}.yml
> python --c dataset_id --evaluate_only True

Training steps: Training can be performed in 1 or 2 steps. Steps are the following ones and, if both performed, will occur in this order:

  1. Training of the classification head only (Bert model used is frozen)
  2. Training of the Bert model and the classification head Each step is optional. To deactivate a training step, set the parameter 'do_train' to False in the appropriate training configurations (in the configuration file)

Note on evaluation: To evaluate a model on a train set and a validation set, one can use the --evaluate_only argument or deactivate the 2 training steps in a configuration file

Parameters for training: See this