Skip to content

avi-jit/SWOW-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SWOW-eval

This project describes a new task for evaluating pre-trained word embeddings. In particular, we present the intrinsic evaluation task originally named SWOW-8500, and employs a large word association dataset called the Small World of Words (SWOW).

This repository also serves as the go-to page regarding our paper titled SWOW-8500: Word Association Task for Intrinsic Evaluation of Word Embeddings, accepted at the RepEval2019 workshop, collocated with the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019 June 2–7, 2019, at Minneapolis, United States.

Contributors

BTech and MTech in Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi.
PhD student, Department of Computer Science, Viterbi School of Engineering, University of Southern California.
Contact: avijit.thawani.cse14@iitbhu.ac.in

Distinguished Data Scientist and Master Inventor, IBM New York.

Associate Professor, Department of Computer Science and Engineering, Indian Institute of Technology (BHU) Varanasi.

Index

  1. Word Associations
  2. Word Embeddings
  3. Intrinsic and Extrinsic Evaluation
  4. How to Run
  5. Results
  6. Further Reading
  7. License

Citation

Please consider citing us if you found our project useful:

@inproceedings{thawani-etal-2019-swow,
    title = "{SWOW}-8500: Word Association task for Intrinsic Evaluation of Word Embeddings",
    author = "Thawani, Avijit  and
      Srivastava, Biplav  and
      Singh, Anil",
    editor = "Rogers, Anna  and
      Drozd, Aleksandr  and
      Rumshisky, Anna  and
      Goldberg, Yoav",
    booktitle = "Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for {NLP}",
    month = jun,
    year = "2019",
    address = "Minneapolis, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-2006",
    doi = "10.18653/v1/W19-2006",
    pages = "43--51",
    abstract = "Downstream evaluation of pretrained word embeddings is expensive, more so for tasks where current state of the art models are very large architectures. Intrinsic evaluation using word similarity or analogy datasets, on the other hand, suffers from several disadvantages. We propose a novel intrinsic evaluation task employing large word association datasets (particularly the Small World of Words dataset). We observe correlations not just between performances on SWOW-8500 and previously proposed intrinsic tasks of word similarity prediction, but also with downstream tasks (eg. Text Classification and Natural Language Inference). Most importantly, we report better confidence intervals for scores on our word association task, with no fall in correlation with downstream performance.",
}

Word Associations

Word Association games are those wherein a participant is asked to utter the first (or first few) words that occur to him/her when given a trigger / cue / stimulus word. For example, given the cue KING, one could respond with RULE, QUEEN, KINGDOM, or even KONG (from the film King Kong). Word associations have long intrigued psychologists including Carl Jung and hence large studies have been conducted in this direction. Some prominent datasets which collect user responses to word association games are enumerated as follows:

  • USF-FA: University of Southern Florida Free Association norms have single-word association responses from an average of 149 participants per cue for a set of 5,019 cue words.
  • EAT: Edinburgh Association Thesaurus collects 100 responses per cue for a total of 8,400 cues.
  • JeuxDeMots: is a crowdsourced game which has collected over 5 million french word associations so far.
  • SWOW (Small World of Words): lists word association and participant data for 100 primary, secondary and tertiary responses to 12,292 cues, collected from over 90,000 participants.
  • Birkbeck norms: contain 40 to 50 responses for over 2,600 cues in British English.

Word Embeddings

Word Embeddings are vector representation of words, i.e. an array of floating point numbers for each word. This helps computers make more sense out of the mystical natural langauge we humans use, and has been fairly helpful in recent developments in Natural Language Processing (think Machine Translation), Information Retrieval (think Search Engines), and Image Captioning (think Google Images or GIF search). In layman terms, the secret lies in letting related words have similar vectors, but there have been a slew of approaches to come up with such word embeddings. We list down some of the most popular, as well as the most effective ones so far. We have used the pretrained versions of these in our experimentation and you shall find them in the folder WordVectors:

You shall also find a Base Random embedding in the folder WordVectors, which is a baseline developed by randomly allotting 300 floating numbers to each word in the common vocabulary of the above five embeddings.

Intrinsic and Extrinsic Evaluation

It is of natural interest to the NLP community to identify evaluation metrics for word embeddings. Besides direct performance measurement on downstream tasks (also called Extrinsic Evaluation) like Sentiment Classification, Question Answering, and Chunking, there have also been proposed several Intrinsic Evaluation measures such as WordSim-353 and SimLex. While Extrinsic Evaluations use word embeddings as input features to a downstream task and measure changes in performance metrics specific to that task, Intrinsic Evaluations directly test for syntactic or semantic relationships between word (Schnabel et al. 2015. Another way to tell apart Intrinsic from Extrinsic evaluations is the lack of any trainable parameters in the former.

Intrinsic tasks are useful as long as they can accurately predict a model's performance on Extrinsic evaluations, since at the end of the day, the ability to solve downstream tasks is all that matters. Here are a few resources and existing projects that aim to bridge the gap between the two, by experimenting thoroughly with multiple tasks and multiple embeddings:

  • VecEval: a repository to run severael Extrinsic Evaluation tasks (slightly outdated) by Nayak et al. 2016.
  • wordvectors.org: a repository to run several Intrinsic Evaluation tasks, by Faruqui and Dyer 2014.
  • ACL SOTA: maintains benchmark pages for word similarity.
  • Vecto AI: an exhaustive collection of Intrinsic tasks, beyond word similarity and relatedness. Vecto is also a library to help run experiments in distributional semantics.
  • WE Benchmarks: another collection of benchmarks for intrinsic evaluation of pretrained embeddings.

How to Run

There are two fairly interactive and easy-to-read scripts (Python 3) in this repository:

  • SWOW subset.ipynb: A Jupyter Notebook that takes you from the original Small World of Words dataset, to a specific format of SWOW-NNNN subset in the form of cue: response. You could set different conditions like modifying the minimum count (frequency) of a cue-response pair, to be acceptable into your custom word association dataset. This results in a simple pickle file, used by the following script.
  • SWOW_eval.py: This reads one or more pretrained embeddings (in FastText .txt format), and an evaluation file (which is the output of the above Jupyter Notebook). It then displays the Precision, Recall, Accuracy, Confidence Interval, OOV (out-of-vocabulary) words for the word association task.

To evaluate a single word embedding file:

python SWOW_eval.py 1 Apr_3_cr_dict_min20.pkl numberbatch_65876.txt
python SWOW_eval.py <debug> <evalFile> <vecFile>

To evaluate multiple embedding files (saved within a specific folder):

python SWOW_eval.py <debug> <evalFile> _ <vecFolder>
python SWOW_eval.py 0 Apr_3_cr_dict_min20.pkl _ wordVectors/

Results

We found that with the new Word Association tasks, we not only save up on a lot of (expensive and time-consuming) human annotation process, but also report much better confidence intervals on intrinsic evaluation tasks. Performance on SWOW-8500 has been shown to correlate with both (1) existing Word Similarity/Relatedness tasks (Intrinsic Evaluation), as well as (2) multiple Downstream tasks (Extrinsic Evaluation). For further details, please refer to our paper (link shall be posted upon publication).

Further Reading

License

Code is licensed under MIT, however available embeddings distributed within package might be under different license. If you are unsure please reach to authors (references are included in docstrings)

About

Intrinsic Evaluation of pre-trained word embeddings, using large Word Association Dataset: SWOW (Small World of Words)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published