Skip to content

yuvalpinter/unblend

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Will it Unblend?

This is the home for code and data from the paper Will it Unblend?, Findings of EMNLP, November 2020.

Contents

November 13, 2021: We released the complex words dataset of 312 novel blends and compounds. The data is in the following schema:

  • class: whether the word is a blend or a compound.
  • word: a word first appearing in the New York Times between November 2017 and March 2019 (taken from NYTWIT, follow link for details).
  • bases: the words contributing to the complex word (space-delimited), manually annotated with help of originating NYT context.
  • sequence: character-level annotation of the word reflecting each character's origin: Prefix, A/B/C one of the bases (labeled successively according to their order in the bases column), X more than one base, O additional material, Suffix. See section 2 of the paper for details.
  • linearity: whether the relation between the base-contributing parts of the word is linear: no O; no A preceded by a B or X; no B followed by an A or X; natural extension to words with a C. Compounds, by definition, contain no X or O and are always linear.
  • semantic relation: the relationship between the bases, annotated according to the schema from Tratz and Hovy, 2010.

Stay tuned for the following releases:

  • Code and data for reproducing the similarity experiments in section 3, including all BERT activations and lists of smoothies. (February 16, 2021)
  • Code and data for reproducing the segmentation experiments in section 4.1, including models for the character LM, the character tagger and the news-trained BPE table.
  • Code and data for reproducing the recovery experiments in section 4.2, including candidate lists. (December 2, 2021)

Citing is Caring

Please use the following citation when you use our data or methods:

@inproceedings{pinter-etal-2020-will,
    title = "Will it Unblend?",
    author = "Pinter, Yuval  and
      Jacobs, Cassandra L.  and
      Eisenstein, Jacob",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.138",
    pages = "1525--1535",
}

About

Will it Unblend? (Findings of EMNLP 2020)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published