SWOW-eval

This project describes a new task for evaluating pre-trained word embeddings. In particular, we present the intrinsic evaluation task originally named SWOW-8500, and employs a large word association dataset called the Small World of Words (SWOW).

This repository also serves as the go-to page regarding our paper titled SWOW-8500: Word Association Task for Intrinsic Evaluation of Word Embeddings, accepted at the RepEval2019 workshop, collocated with the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019 June 2–7, 2019, at Minneapolis, United States.

Contributors

Avijit Thawani (author)

BTech and MTech in Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi.
PhD student, Department of Computer Science, Viterbi School of Engineering, University of Southern California.
Contact: avijit.thawani.cse14@iitbhu.ac.in

Biplav Srivastava

Distinguished Data Scientist and Master Inventor, IBM New York.

Anil K Singh

Associate Professor, Department of Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi.

Citation

Please consider citing us if you found our project useful:

@inproceedings{thawani-etal-2019-swow,
    title = "{SWOW}-8500: Word Association task for Intrinsic Evaluation of Word Embeddings",
    author = "Thawani, Avijit  and
      Srivastava, Biplav  and
      Singh, Anil",
    editor = "Rogers, Anna  and
      Drozd, Aleksandr  and
      Rumshisky, Anna  and
      Goldberg, Yoav",
    booktitle = "Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for {NLP}",
    month = jun,
    year = "2019",
    address = "Minneapolis, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-2006",
    doi = "10.18653/v1/W19-2006",
    pages = "43--51",
    abstract = "Downstream evaluation of pretrained word embeddings is expensive, more so for tasks where current state of the art models are very large architectures. Intrinsic evaluation using word similarity or analogy datasets, on the other hand, suffers from several disadvantages. We propose a novel intrinsic evaluation task employing large word association datasets (particularly the Small World of Words dataset). We observe correlations not just between performances on SWOW-8500 and previously proposed intrinsic tasks of word similarity prediction, but also with downstream tasks (eg. Text Classification and Natural Language Inference). Most importantly, we report better confidence intervals for scores on our word association task, with no fall in correlation with downstream performance.",
}

Word Associations

Word Association games are those wherein a participant is asked to utter the first (or first few) words that occur to him/her when given a trigger / cue / stimulus word. For example, given the cue KING, one could respond with RULE, QUEEN, KINGDOM, or even KONG (from the film King Kong). Word associations have long intrigued psychologists including Carl Jung and hence large studies have been conducted in this direction. Some prominent datasets which collect user responses to word association games are enumerated as follows:

USF-FA: University of Southern Florida Free Association norms have single-word association responses from an average of 149 participants per cue for a set of 5,019 cue words.
EAT: Edinburgh Association Thesaurus collects 100 responses per cue for a total of 8,400 cues.
JeuxDeMots: is a crowdsourced game which has collected over 5 million french word associations so far.
SWOW (Small World of Words): lists word association and participant data for 100 primary, secondary and tertiary responses to 12,292 cues, collected from over 90,000 participants.
Birkbeck norms: contain 40 to 50 responses for over 2,600 cues in British English.

Word Embeddings

Word Embeddings are vector representation of words, i.e. an array of floating point numbers for each word. This helps computers make more sense out of the mystical natural langauge we humans use, and has been fairly helpful in recent developments in Natural Language Processing (think Machine Translation), Information Retrieval (think Search Engines), and Image Captioning (think Google Images or GIF search). In layman terms, the secret lies in letting related words have similar vectors, but there have been a slew of approaches to come up with such word embeddings. We list down some of the most popular, as well as the most effective ones so far. We have used the pretrained versions of these in our experimentation and you shall find them in the folder WordVectors:

Word2Vec Skip Gram (Mikolov et al. 2013a ; Mikolov et al. 2013b) trained on Google News. [Download]
GloVe (Pennington et al. 2014) trained on Wikipedia 2014 and Gigaword 5. [Download]
FastText (Bojanowski et al. 2017) trained with subword information on Common Crawl (600B tokens). [Download]
ConceptNet Numberbatch (Speer et al. 2017) trained on a knowledge graph and some text corpora. [Download]
Count based (Baroni et al. 2014) which is the result of reducing dimensionality of a large count matrix. [Download]

You shall also find a Base Random embedding in the folder WordVectors, which is a baseline developed by randomly allotting 300 floating numbers to each word in the common vocabulary of the above five embeddings.

Intrinsic and Extrinsic Evaluation

It is of natural interest to the NLP community to identify evaluation metrics for word embeddings. Besides direct performance measurement on downstream tasks (also called Extrinsic Evaluation) like Sentiment Classification, Question Answering, and Chunking, there have also been proposed several Intrinsic Evaluation measures such as WordSim-353 and SimLex. While Extrinsic Evaluations use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task, Intrinsic Evaluations directly test for syntactic or semantic relationships between word (Schnabel et al. 2015. Another way to tell apart Intrinsic from Extrinsic evaluations is the lack of any trainable parameters in the former.

Intrinsic tasks are useful as long as they can accurately predict a model's performance on Extrinsic evaluations, since at the end of the day, the ability to solve downstream tasks is all that matters. Here are a few resources and existing projects that aim to bridge the gap between the two, by experimenting thoroughly with multiple tasks and multiple embeddings:

VecEval: a repository to run severael Extrinsic Evaluation tasks (slightly outdated) by Nayak et al. 2016.
wordvectors.org: a repository to run several Intrinsic Evaluation tasks, by Faruqui and Dyer 2014.
ACL SOTA: maintains benchmark pages for word similarity.
Vecto AI: an exhaustive collection of Intrinsic tasks, beyond word similarity and relatedness. Vecto is also a library to help run experiments in distributional semantics.
WE Benchmarks: another collection of benchmarks for intrinsic evaluation of pretrained embeddings.

How to Run

There are two fairly interactive and easy-to-read scripts (Python 3) in this repository:

SWOW subset.ipynb: A Jupyter Notebook that takes you from the original Small World of Words dataset, to a specific format of SWOW-NNNN subset in the form of cue: response. You could set different conditions like modifying the minimum count (frequency) of a cue-response pair, to be acceptable into your custom word association dataset. This results in a simple pickle file, used by the following script.
SWOW_eval.py: This reads one or more pretrained embeddings (in FastText .txt format), and an evaluation file (which is the output of the above Jupyter Notebook). It then displays the Precision, Recall, Accuracy, Confidence Interval, OOV (out-of-vocabulary) words for the word association task.

To evaluate a single word embedding file:

python SWOW_eval.py 1 Apr_3_cr_dict_min20.pkl numberbatch_65876.txt
python SWOW_eval.py <debug> <evalFile> <vecFile>

To evaluate multiple embedding files (saved within a specific folder):

python SWOW_eval.py <debug> <evalFile> _ <vecFolder>
python SWOW_eval.py 0 Apr_3_cr_dict_min20.pkl _ wordVectors/

Results

We found that with the new Word Association tasks, we not only save up on a lot of (expensive and time-consuming) human annotation process, but also report much better confidence intervals on intrinsic evaluation tasks. Performance on SWOW-8500 has been shown to correlate with both (1) existing Word Similarity/Relatedness tasks (Intrinsic Evaluation), as well as (2) multiple Downstream tasks (Extrinsic Evaluation). For further details, please refer to our paper (link shall be posted upon publication).

License

Code is licensed under MIT, however available embeddings distributed within package might be under different license. If you are unsure please reach to authors (references are included in docstrings)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
1559781908296_small.pdf		1559781908296_small.pdf
LICENSE		LICENSE
README.md		README.md
SWOW subset.ipynb		SWOW subset.ipynb
SWOW_eval.py		SWOW_eval.py
wordgame.jpg		wordgame.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1559781908296_small.pdf

1559781908296_small.pdf

LICENSE

LICENSE

README.md

README.md

SWOW subset.ipynb

SWOW subset.ipynb

SWOW_eval.py

SWOW_eval.py

wordgame.jpg

wordgame.jpg

Repository files navigation

SWOW-eval

Contributors

Avijit Thawani (author)

Biplav Srivastava

Anil K Singh

Index

Citation

Word Associations

Word Embeddings

Intrinsic and Extrinsic Evaluation

How to Run

Results

Further Reading

License

About

Releases

Packages

Languages

License

avi-jit/SWOW-eval

Folders and files

Latest commit

History

Repository files navigation

SWOW-eval

Contributors

Avijit Thawani (author)

Index

Citation

Word Associations

Word Embeddings

Intrinsic and Extrinsic Evaluation

How to Run

Results

Further Reading

License

About

Resources

License

Stars

Watchers

Forks

Languages