Code for "Astraea: Grammar-based Fairness Testing". In this repository, we present code for fairness of three NLP tasks, Coreference Resolution, Sentiment Analysis and Masked Language Modeling
The code for the fairness testing of Coreference Resolution (coref) can be found in the Coreference-Resolution
folder. We test three coref NLP algorithms. Deep-learning based Neuralcoref, AllenNLP and rule-based Stanford CoreNLP
Please see the respective pages for detailed installation instructions.
For each coreference resolution module, we have two grammar variants. They are the ambiguous and unambiguous grammars. The ambiguous variant tests for fairness related to occupation, gender and religion whereas the unambiguous variant tests only for fairness with respect to gender. Additionally, for easy reproducilbility and verification we provide all the generated pickles and tokens. These analysis script can be run in the Data-Analysis files for each grammar variant.
We evaluate 11 (6 pre-trained, 5 self-trained) sentiment analysis models. The self-trained models can be found in the folder Sentiment Analysis/trained-sentiment-analyzers
. Please refer to the paper for details of the self-trained models.
Astraea elavulates the following pre-trained sentiment analysis models:
- Pattern Analysis TextBlob
- NaiveBayes TextBlob
- NLTK-Vader
- Vader Sentiment
- Google NLP
- Stanford CoreNLP
Please refer to the specific page for installation instructions.
We evaluated bert-cased, bert-uncased, distilbert-cased and distilbert-uncased. Please refer to the Huggingface page for further documentation. These models must be stored in the Masked-Language-Modelling/models
folder
As with the other cases, we provide the tokens and the errors in pickle files for easy reproduction. These are stored in the folder Masked-Language-Modelling/saved_pickles
Astraea is a two-phase approach. Given an NLP model f, the input grammar and sensitive attributes from the grammar, Astraea first randomly (RAND) explores the grammar production rules to generate a large number of input sentences. For any two sentences a and b that only differ in the sensitive attributes, Astraea highlights an (individual) fairness violation when f (a) differs from f (b). In the second phase, Astraea analyses the fairness violations discovered in the first phase and isolates input features (e.g. the specific occupation or gender) that are predominantly responsible for fairness violations. In the second phase, such input features are prioritized in generating the tests.
The goal is to direct the test generation process and steer the model execution to increase the density of fairness violations.
In our current Python implementation, the RAND method is represented by the files marked Exploration
and for the PROB method the files use the tag Exploitation
Please refer to the paper for additional details.
We examine whether Astraea’s bias mitigation of error-inducing input tokens generalises to unseen input sentences, in particular, sentences in the wild that contain previously error-inducing tokens. For instance, if Astraea identified the token “CEO" as the most error-inducing token in a sentiment analyser, we check if other sentences in the wild containing “CEO" token still lead to fairness violations in the re-trained models. To address this, we collected five (5) and ten (10) of the topmost error-inducing input tokens identified by Astraea. As an example, we choose the top five or 10 most biased (fe)male occupations from our sentiment analysis experiments. Then, using the sentences provided by a different sentiment analysis dataset Winogender, we replaced these error-inducing tokens in these sentences and test them on both the original and re-trained models. Astraea’s bias mitigation generalises to unseen input sentences containing the error-inducing input tokens. The models can be found here.
Please refer to the paper for additional details.
We employ grammar coverage as a test adequacy criterion for Astraea. We have selected grammar coverage because it is the most practical metric in a black box setting.
To show test coverage, we measure total terminal symbol coverage with respect to the input grammar and the total number of pairs of terminal symbols covered with respect to the pairs of terminal symbols associated with the sensitive attribute.
Please refer to Table 8 in the paper for additional details.
This folder contains the code and data used to evaluate the syntactic and semantic validity of the input grammar used by Astraea. We evaluate the correctness of our input grammar by examining the validity of the generated input sentences, in terms of their syntactic and semantic validity. Firstly, we employ grammarly to evaluate the syntactic validity of all generated inputs, we show that almost all (97.4%) of Astraea’s generated inputs are syntactically valid. We also conduct a user study with 205 participants to evaluate the semantic validity of ASTRAEA’s generated inputs, especially in comparison to semantic validity of human-written input sentences. Our results show that ASTRAEA’s generated input sentences are 81% as semantically valid as human-written input sentences.
This repository is still under development. Please email Sakshi Udeshi (sakshi_udeshi@mymail.sutd.edu.sg), Ezekiel Soremekun (ezekiel.soremekun@uni.lu) or Sudipta Chattopadhyay (sudipta_chattopadhyay@sutd.edu.sg) for any questions.
@article{astraea,
title={Astraea: Grammar-based Fairness Testing},
author={Ezekiel Soremekun and
Sakshi Udeshi and
Sudipta Chattopadhyay},
booktitle={IEEE Transactions on Software Engineering (TSE)},
year={2022}
}