Skip to content
/ MWSA Public

Datasets for the Monolingual Word Sense Alignment (MWSA) task

License

Notifications You must be signed in to change notification settings

elexis-eu/MWSA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Evaluation Datasets for Monolingual Word Sense Alignment

alt text

Monolingual Word Sense alignment (MWSA) is the task of aligning word senses across resources in the same language. A word can be defined in different ways in different resources. Finding out which ones are somehow connected together is the task of word sense alignment. This task was recently the focus of the 1st "Monolingual Word Sense Alignment" Shared Task.

The current repository contains a set of 17 datasets of manually-annotated senses developed within the ELEXIS project. These datasets cover 15 languages and are based on expert-made dictionaries along with collaboratively-curated ones, such as Wiktionary. The following table shows the statistics of the datasets by providing the number of senses (number of the words in the definitions are provided in parentheses).

Data description

Language Resource Nouns Verbs Adjectives Adverbs Other All
Basque (eu) Basque Wordnet 929 (6836) 0 (0) 0 (0) 0 (0) 0 (0) 929 (6836)
Euskal Hiztegia 971 (7754) 0 (0) 0 (0) 0 (0) 0 (0) 971 (7754)
Bulgarian (bg) BTB-WN 1394 (15649) 175 (1698) 305 (3187) 50 (338) 0 (0) 1924 (20872)
Bulgarian Wiktionary 1273 (12883) 164 (1107) 194 (1418) 39 (306) 0 (0) 1670 (15714)
Danish (da) Ordbog over det danske Sprog 2176 (282040) 983 (119163) 436 (60599) 0 (0) 0 (0) 3595 (461802)
Den Danske Ordbog 1036 (12326) 383 (4045) 248 (2228) 0 (0) 0 (0) 1667 (18599)
Dutch (NL) Woordenboek der Nederlandsche Taal 1459 (28979) 405 (5185) 527 (7878) 106 (2662) 0 (0) 2497 (44704)
Algemeen Nederlands Woordenboek 497 (8443) 140 (1542) 109 (1393) 13 (172) 0 (0) 759 (11550)
English (KD) (en) Global 92 (532) 107 (617) 80 (457) 57 (257) 61 (283) 397 (2146)
Password 66 (536) 72 (417) 62 (324) 33 (177) 46 (188) 279 (1642)
English (NUIG) (en) Webster 1913 1131 (11606) 741 (4622) 373 (2585) 45 (269) 0 (0) 2290 (19082)
Princeton WordNet 730 (12166) 496 (6980) 249 (2892) 24 (207) 0 (0) 1499 (22245)
Estonian (es) Dictionary of Estonian (EKS) 543 (4012) 273 (1598) 151 (747) 98 (451) 78 (370) 1143 (7178)
Estonian Basic Dictionary (PSV) 543 (4492) 273 (1983) 151 (1097) 98 (596) 79 (468) 1144 (8636)
German (de) German Wiktionary 2026 (15160) 0 (0) 0 (0) 0 (0) 0 (0) 2026 (15160)
German OmegaWiki 1266 (14354) 0 (0) 0 (0) 0 (0) 0 (0) 1266 (14354)
Hungarian (hu) Comprehensive X X X X X 1355 (14654)
Explanatory X X X X X 1038 (10934)
Irish (ga) An Foclóir Beag 891 (8053) 11 (95) 55 (267) 10 (56) 36 (171) 1003 (8642)
Irish Wiktionary 1209 (6696) 8 (45) 61 (181) 10 (41) 36 (109) 1324 (7072)
Italian (it) ItalWordNet 408 (3128) 352 (2411) 0 (0) 0 (0) 0 (0) 760 (5539)
SIMPLE 290 (1990) 218 (1240) 0 (0) 0 (0) 0 (0) 508 (3230)
Serbian (sr) Serbian WordNet 691 (5864) 985 (6522) 92 (713) 0 (0) 0 (0) 1768 (13099)
Dictionary of Serbo-Croatian Literary Language 289 (2360) 281 (1527) 29 (215) 0 (0) 0 (0) 599 (4102)
Slovenian (JSI) (sl) Slovene WordNet 409 (1106) 303 (901) 237 (733) 44 (133) 0 (0) 993 (2873)
Slovene Lexical Database 284 (2237) 191 (1047) 220 (1486) 29 (102) 0 (0) 724 (4872)
Slovenian (ISJFR) (sl) Standard Slovenian Dictionary (eSSKJ) 229 (2060) 109 (911) 76 (620) 0 (0) 60 (588) 474 (4179)
Kostelski slovar 151 (1050) 61 (308) 45 (257) 0 (0) 38 (263) 295 (1878)
Spanish (es) Diccionario de la lengua española 617 (7986) 225 (2426) 305 (3269) 26 (161) 24 (250) 1197 (14092)
Spanish Wiktionary 602 (6421) 227 (2045) 294 (2825) 25 (129) 22 (123) 1170 (11543)
Portuguese (pt-pt) Dicionário da Língua Portuguesa Contemporânea 285 (4060) 58 (686) 110 (1287) 9 (143) 1 (9) 463 (6185)
Dicionário Aberto 199 (1521) 53 (203) 67 (372) 3 (15) 1 (5) 323 (2116)
Russian (rs) Ozhegov-Shvedova 258 (2038) 109 (615) 101 (533) 15 (77) 44 (368) 527 (3631)
Dictionary of the Russian Language (MAS) 310 (2811) 173 (1338) 190 (1219) 20 (114) 71 (1010) 764 (6492)

This repository contains datasets in JSON, RDF and TSV. In the latter format, each line corresponds to a sense pair where the last column represents the type of semantic relationship. We have also included the induced semantic relationships based on the symmetric property of the relationships, as follows:

especial	adjective		que se aplica exclusivamente a alguém ou a alguma coisa. ≈ exclusivo, particular, privado.	exclusivo.	narrower
especial	adjective		exclusivo.	que se aplica exclusivamente a alguém ou a alguma coisa. ≈ exclusivo, particular, privado.	broader

where the first row represents a narrower relation while the second one is broader with the senses being swapped.

Conversion to RDF

json-to-rdf.py is a simple script that converts the JSON alignments into TSV and then RDF. This allows you to use the datasets with NAISC.

Reference

If you're using any part of these datasets, please don't forget to cite the following paper:

@inproceedings{ahmadi2020multilingual,
	title={A Multilingual Evaluation Dataset for Monolingual Word Sense Alignment},
	author="Ahmadi, Sina and McCrae, John P. and Nimb, Sanni and Khan, Fahad and Monachini, Monica and Pedersen, Bolette S. and Declerck, Thierry and Wissik, Tanja and Bellandi, Andrea and Pisani, Irene and Troelsgård, Thomas and Olsen, Sussi and Krek, Simon and Lipp, Veronika and Váradi, Tamás and Simon, László and Győrffy, András and Tiberius, Carole and Schoonheim, Tanneke and Ben Moshe, Yifat and Rudich, Maya and Abu Ahmad, Raya and Lonke, Dorielle and Kovalenko, Kira and Langemets, Margit and Kallas, Jelena and Dereza, Oksana and Fransen, Theodorus and Cillessen, David and Lindemann, David and Alonso, Mikel and Salgado, Ana and Sancho, José Luis and Ureña-Ruiz, Rafael-J. and Simov, Kiril and Osenova, Petya and Kancheva, Zara and Radev, Ivaylo and Stanković, Ranka and Perdih, Andrej and Gabrovšek, Dejan",
	booktitle="Proceedings of the 12th Language Resource and Evaluation Conference (LREC 2020)",
	year={2020},
	date="2020-05-11",
	address= "Marseille, France"
}

Licence

This repository is licensed under the Apache License 2.0.

Releases

No releases published

Packages

No packages published

Languages