Skip to content

Data for discrimination of word senses using hypernyms

Notifications You must be signed in to change notification settings

artreven/thesaural_wsi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Thesaural Word Sense Induction

Data for discrimination of word senses using hypernyms.

The repository contains two datasets: cocktails and mesh.

Thesaurus

The thesaurus "All about cocktails" can be found in cocktail.ttl file in turtle format.

The MeSH thesaurus is not contained in this repository. It can be found online at https://id.nlm.nih.gov/mesh/.

Corpora

The corpora are extracted using Wikilinks. Every folder contains a corpus related to an ambiguous cocktail name. In evert folder there exists a file forms.tsv containing all the surface forms of the cocktail name. The individual texts are stored in separate file. The naming convention for the file is the following:

{numeric id}__{category name}.txt

The category name is the English Wikipedia id of the category, i.e. in order to go to the Wikipedia page of the category go to the URL https://en.wikipedia.org/wiki/{category name}.