Skip to content

FACT-AI (Fairness, Accountability, Confidentiality, and Transparency in AI) Project on Debiasing word embeddings

License

Notifications You must be signed in to change notification settings

martinetoering/Embetter

 
 

Repository files navigation

Embetter: FactAI Word Embedding Debiassing

Are you using pre-trained word embeddings like word2vec, GloVe or fastText? Embetter allows you to easily remove gender bias. No longer can nurses only be female or are males the only cowards.

This repository was made during the FACT-AI course at the University of Amsterdam, during which papers from the FACT field are reproduced and possibly extended.

Getting Started

Here we will outline how to access the code.

Prerequisites

To run the code, create an Anaconda environment using:

conda env create -f environment.yml

or create an empty environment, install pip in this environment and run:

pip install -r requirements.txt

This will install all dependencies for this package.

Installing

To use the code from this package, simply download or clone the repository.

Available pre-trained embeddings

Several pre-trained embeddings are provided and are automatically downloaded upon creating a WordEmbedding object with the name of one of the available embeddings. The embeddings are based on Word2vec, GloVe, and fastText.

The available embeddings are listed below.

Embedding name dimensionality vocabulary size
word2vec_large 300 3M
word2vec_small 300 26423
word2vec_small_hard_debiased 300 26423
word2vec_small_soft_debiased 300 26423
glove_large 300 1.9M
glove_small 300 42982
glove_small_hard_debiased 300 42982
glove_small_soft_debiased 300 42982
fasttext_large 300 999994
fasttext_small 300 27014
fasttext_small_hard_debiased 300 27014
fasttext_small_soft_debiased 300 27014

Note that because of the large computational workload, hard debiasing of large embeddings (entire vocabulary) is very difficult and soft debiasing with the current setup is impossible, and are therefore not available for download.

Embedding from file

Besides the available embeddings, it is also possible to load embeddings from a file on your device. If the embedding is very large and loading the embedding takes unreasonably long, make sure the filename ends with "large" (possibly preceding a file extension). E.g. ./myfolder/myembedding_large.txt. This makes sure that the embedding is loaded using the gensim package, which scales better with large embeddings. However, if you do this, the first line of the file must contain the number of words in the embedding, followed by a space, followed by the embedding dimensionality, e.g. for glove_large this is 1900000 300.

Besides this first line for large embeddings, each line in the file must start with a word from the vocabulary, followed by the vector values, separated by whitespace characters. Binary (.bin) files are also supported for the available word2vec binary embedding files, but support is not guaranteed for manually created binary files and should not be heavily relied upon.

Example

from embetter.we import WordEmbedding

w2v = WordEmbedding('word2vec_small')

my_embed = WordEmbedding('./myfolder/myembedding.txt')

Running the experiments

A general tutorial on the usage of this package, together with some of the experiments are available in the Jupyter notebook in the repository.
For a full replication of all the available experiments, you can execute the experiments.py script.
Use

python experiments.py -h

in your terminal to get a list of options on which experiments to run.

NOTE: Creating analogies is a very costly procedure which requires comparisons between a lot of different words. For larger word embeddings, this quickly results in memory issues. Even for glove_small, which contains 42982 words, creating analogies may prove problematic. To run the benchmarks without generating analogies, use the --no_show flag to avoid creating analogies and profession analysis:

python experiments.py --no_show

Authors

Kylian van Geijtenbeek
Thom Visser
Martine Toering
Iulia Ionescu

Morris Frank - TA

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgements

The debiasing methods that we provide have been proposed by

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai
2016 (https://arxiv.org/abs/1607.06520)

and full credit for these methods goes to them.

The code that we provide is largely based on their code, which is available in their Github repository:
https://github.com/tolga-b/debiaswe.

Additionally, some of the code borrows from:

About

FACT-AI (Fairness, Accountability, Confidentiality, and Transparency in AI) Project on Debiasing word embeddings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 66.9%
  • Python 33.1%