Skip to content

google-research-datasets/nlp-fairness-for-india

Repository files navigation

This repository contains data resources for the paper "Re-contextualizing Fairness in NLP: The Case of India" accepted as to AACL-IJCNLP 2022.

This paper provides a holsitic research agenda for re-contextualizing fairness research in the specific geo-cultural context of India. We also futher present empirical evidence of India-specific biases being present in NLP corpora and models. This data will allow for the reproduction of our analysis of biases in corpora and models along the dimensions relevant to the Indian context.

The dataset contains tuples of the form (identity term, attribute) (for eg: (gujarati, entrepreneur)). These tuples are then annotated by human-raters for whether the attribute is commonly associated with the identity term as a stereotype. The tuples were created with a combination of dictionary driven (relying on previous literature for list of characteristics and identity terms) and corpora driven (filtering based on occurrence in IndicCorp-en) approaches. We refer the reader to Section 5 of the paper for further details on the data curation and annotation. We also retain individal annotations with anonymized annotator ids and self-identified gender and geographic region following Prabhakaran et al., 2021. Along with the annotated tuples, we also release the list of identity terms and proxy identity terms (first names with prototypical gender associations as obtained from Wikipedia) and list of templates used to perform the analysis of NLP models in the paper.

About

Contains data resources to replicate results from the paper “Re-contextualizing Fairness in NLP: The Case of India”.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published