This repository contains data resources for the paper "Re-contextualizing Fairness in NLP: The Case of India" accepted as to AACL-IJCNLP 2022.
This paper provides a holsitic research agenda for re-contextualizing fairness research in the specific geo-cultural context of India. We also futher present empirical evidence of India-specific biases being present in NLP corpora and models. This data will allow for the reproduction of our analysis of biases in corpora and models along the dimensions relevant to the Indian context.
The dataset contains tuples of the form (identity term, attribute) (for eg: (gujarati, entrepreneur)). These tuples are then annotated by human-raters for whether the attribute is commonly associated with the identity term as a stereotype. The tuples were created with a combination of dictionary driven (relying on previous literature for list of characteristics and identity terms) and corpora driven (filtering based on occurrence in IndicCorp-en) approaches. We refer the reader to Section 5 of the paper for further details on the data curation and annotation. We also retain individal annotations with anonymized annotator ids and self-identified gender and geographic region following Prabhakaran et al., 2021. Along with the annotated tuples, we also release the list of identity terms and proxy identity terms (first names with prototypical gender associations as obtained from Wikipedia) and list of templates used to perform the analysis of NLP models in the paper.