Skip to content

ltgoslo/wugs_with_definitions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Usage Graphs enriched with cluster definitions

This is a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The resulting enriched datasets can be helpful for explainable semantic change modeling.

Contents

  • code/: various scripts we used in preparing the datasets
  • human_evaluation/: everything related to our evaluation efforts
  • wug_labels/: the cluster labels themselves, the main part.

We provide cluster labels (sense definitions) for the following WUGs:

Format

Every WUG dataset in the wug_labels/ directory contains target word subdirectories, according to the original DWUG format. Within each target word directory, we provide one file named cluster_gloss.tsv. It is a tab-separated dataframe with two columns:

  • cluster: the numerical identifier of the cluster from the original WUG
  • gloss: the definition generated for this cluster

The cluster labels should be used together with the original word usage graphs for the corresponding languages. As a rule, one can find clusters assigned to every specific WUG usage (sentence) in the clusters/ directory.

NB: some clusters are too small to generate a meaningful definition (less than 3 usages). In these cases, the definition is accordingly "Too few examples to generate a proper definition!".

Citation

See details in the paper "Enriching Word Usage Graphs with Cluster Definitions" (LREC-COLING'2024) by Mariia Fedorova, Andrey Kutuzov, Nikolay Arefyev and Dominik Schlechtweg.

Definition generation models:

About

Methods that annotate word occurrences with glosses describing their meaning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published