Skip to content

sinaahmadi/KurdishTokenization

Repository files navigation

Kurdish Tokenization

A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects)

This repository contains data of the tokenization system described in the paper entitled "A Tokenization System for the Kurdish Language". An approach is proposed for the tokenization of the Sorani and Kurmanji dialects of Kurdish using a lexicon and a morphological analyzer. The tokenizer is available as a module in the Kurdish Language Processing Toolkit (KLPT).

Gold-standard Datasets

In addition to the tokenization tool, we provide a gold-standard dataset in the data folder containing 100 Sorani and Kurmanji sentences in the Text Corpus Format. These sentences are manually tokenized and therefore can be used for evaluation purposes.

Annotated Lexicons

We also provide a set of manually-annotated lexicons for this tool which are constantly being updated and completed. These lexicons contain word lemmata in Kurdish along with hyphen-separated multi-word expressions. The current version contains lexicographic data provided by the FreeDict project and Wîkîferheng, the Kurdish Wiktionary. The transliteration of the Latin-based script of Kurdish into the Latin-based one is carried out using Wergor. Please follow the instructions of the Kurdish Language Processing Toolkit (KLPT), if you would like to take part in the enrichment of resources.

The following shows two lemmata in the Kurmanji lexicon where the possible writings of a compound word-form are provided in the token_forms field.

"riswa": []
"riswa-kirin": {
"token_forms": ["riswakirin", "riswa kirin"]
}

For researchers

If you would like to extend the current study, the trained models can be found in the models directory. Please use the corresponding libraries to import the models in your pipelines. The output of the models are also available in the experiments folder.

Contribute

Are you interested in this project? Please follow the instructions of the Kurdish Language Processing Toolkit (KLPT) to get involved. Open-source is fun! 😊

Cite this paper

Please consider citing this paper, if you use any part of the data or the tool (bib file):

@inproceedings{ahmadi2020tokenization,
  title={{A Tokenization System for the Kurdish Language}},
  author={Ahmadi, Sina},
  booktitle={Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
  pages={},
  year={2020}
}

License

Creative Commons License
The annotated resources by Sina Ahmadi are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:

  • You are free to share, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material for any purpose, even commercially.
  • You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
  • If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.