Skip to content

Latest commit

 

History

History
311 lines (255 loc) · 16.8 KB

datastructures.md

File metadata and controls

311 lines (255 loc) · 16.8 KB

Arramooz Dictionary

Data structure description

This file describe the structure of verbs and nouns in multiple formats (csv, sql, xml, etc.)

Tables and CSV

Verbs

Field Type Description وصف
vocalized String vocalized word الكلمة مشكولة
unvocalized String unvocalized word الكلمة غير مشكولة
root String root of the verb جذر الفعل
normalized String normalized form of verb (Hamzat are unified) الفعل منمّط، الهمزات والألفات موحّدة
stamped String normalized verb without affixation letters بصمة الفعل، حذف كل حروف الزيادة
future_type String The future mark, used only for trilateral verbs حركة عين الفعل الثلاثي في المضارع
triliteral Boolean the verb is triliteral (3 letters) or not الفعل ثلاثي/غير ثلاثي
transitive Boolean transitive or not فعل متعدي/ لازم
double_trans Boolean has double transitivity for two objects متعدي لمفعولين
think_trans Boolean the verb is transitive to human متعدي للغاقل
unthink_trans Boolean the verb is transitive to non human being متعدي لغير العاقل
reflexive_trans Boolean pronominal verb فعل من أفعال القلوب
past Boolean can be conjugated in past tense يتصرف في الماضي
future Boolean can be conjugated in present and future tense يتصرف في المضارع
imperative Boolean can be conjugated in imperative يتصرف في الأمر
passive Boolean can be conjugated in passive voice يتصرف في المبني للمجهول
future_moode Boolean can be conjugated in future moode (jusive, subjuctive, ) يتصرف في المضارع المجزوم أو المنصوب
confirmed Boolean can be conjugated in confirmed tenses يتصرف في المؤكد

We can regroup features as:

  • Word: vocalized form of word, with full diacritics, e.g: "ضَرَبَ" [ to hit]

  • Basic verb features:

    • root of verb
    • transitive or not transitive
    • Tri-letters or not : length of verb lemma
    • future type: the mark used in future tense used only for tri letters verbs.
  • Features used for search and lookup:

    • Unvocalized: word without diacritics e.g. "ضرب"

    • Normalized form: unify Hamzat and Alefat to find other word forms, e.g. normalized form of "سأل" is "سءل".

    • Stamped form:

      This feature is used to find lemma from stem with letters variants.

      The stamp is generated by removing letters which can be used as affixation letters (prefix, infix, suffix) such as (ALEF, YEH, WAW, ALEF_MAKSURA, HAMZA, ALEF_HAMZA_ABOVE, WAW_HAMZA, YEH_HAMZA, ALEF_MADDA, SHADDA).

      The following verbs generate a stamp "كتب", which help to find more similar verbs from inflected verb

      • كَتَبَ يَكْتُبُ
      • اِكْتَأَبَ يَكْتَئِبُ
      • اِكْتَبَى يَكْتَبِي
      • كَتَّبَ يُكَتِّبُ
      • كاتَبَ يُكَاتِبُ
      • أَكْتَبَ يُكْتِبُ

      The following verbs generate the stamp "رم", which help to find more similar verbs from inflected verb: رَامَ أَرَمَ رَأَمَ رَئِمَ راءَمَ أَرْأَمَ رَمَأَ أَرْمَأَ رَمَّ رَمَّمَ أَرَمَّ رَمَى رامَى أَرْمَى رَوَّمَ رَيَّمَ وَرِمَ وَرَّمَ أَوْرَمَ

  • Conjugation verb features

    • Accepted tenses to be used to conjugate verb (boolean features) [ past, future, imperative, passive mode, future moods, confirmed mood]
  • Syntax and semantic affixes:

    Advanced features about syntactical affixes added to verb:

    • think_trans: the verb accept to be attached with human attached pronoun like (هم، هن، نا، ني)

      for example: the verb "تنفّس" don't accept a human as object.

    • unthink_trans: the verb accept to be attached with non human attached pronoun like (ها)

      for example: the verb "تنفّس" accept a non human as object (تنفّس الغاز)

    • reflexive_trans: the verb accept to be attached with a reflexive attached pronoun like (نا ني)

      for example: the verb "ضرب" accept a reflexive object (ضربت نفسي، ضربتني)

    • double_trans: has double transitivity for two objects, can accept two attached pronouns like :

      أعطيتمونيها

SQL format of verb

CREATE TABLE verbs ( id int unique, 
	vocalized varchar(30) not null, 
	unvocalized varchar(30) not null, 
	root varchar(30), 
	normalized varchar(30) not null, 
	stamped varchar(30) not null, 
	future_type varchar(5), 
	triliteral tinyint(1) default 0, 
	transitive tinyint(1) default 0, 
	double_trans tinyint(1) default 0, 
	think_trans tinyint(1) default 0, 
	unthink_trans tinyint(1) default 0, 
	reflexive_trans tinyint(1) default 0, 
	past tinyint(1) default 0, 
	future tinyint(1) default 0, 
	imperative tinyint(1) default 0, 
	passive tinyint(1) default 0, 
	future_moode tinyint(1) default 0, 
	confirmed tinyint(1) default 0, 
	PRIMARY KEY (id) )

XML format

<?xml version='1.0' encoding='utf8'?>
<dictionary>
<verb future_type='كسرة' 
      triliteral='1' 
      transitive='1'
      double_trans='0'
      think_trans='1'
      unthink_trans='0'
      reflexive_trans='0' >
 <word>ضَرَبَ</word>
 <unvocalized>ضرب</unvocalized>
 <root>ضرب</root>
 <normalized>ضرب</normalized>
 <stamped>ضرب</stamped>
 <tenses past='0'
         future='0'
         imperative='0'
         passive='0' 
         future_moode='0'
         confirmed='0'/>
</verb>
....
</dictionary>

Nouns

Database description

Field Description وصف
vocalized vocalized word الكلمة مشكولة
unvocalized unvocalized word غير مشكولة
wordtype word type( Noun of Subject, noun of object, …) نوع الكلمة (اسم فاعل، اسم مفعول، صيغة مبالغة..)
root word root جذر الكلمة
wazn word pattern or template وزن الكلمة
normalized normalized form of noun (Hamzat are unified) الاسم منمّط، الهمزات والألفات موحدة الأشكال
stamped normalized noun without affixation letters بصمة الاسم، حروف الزيادة محذوفة
category word category صنف الكلمة أو قسمها الفرعي
original original verb or noun (masdar) مصدر الكلمة فعل او اسم
mankous if the word is mankous, ends with Yeh اسم منقوص
defined the word is defined or not معرفة
gender the word gender نوع أو جنس الكلمة
feminin the feminin form of the word مؤنث الكلمة
masculin the masculin form of the word مذكر الكلمة
number the word is sigle, dual or plural عدد مفرد/مثنى/جمع
single the single form of the word مفرد الكلمة
dualable accept dual suffix يقبل التثنية
feminable the word accept Teh_marbuta يقبل تاء التأنيث
masculin_plural accept masculine plural يقبل جمع المذكر السالم
feminin_plural accept feminine plural يقبل جمع المؤنث السالم
broken_plural the irregular plural if exists جموع تكسيره إن وجدت
mamnou3_sarf doesnt accept tanwin ممنوع من الصرف
relative relative منسوب يالياء
w_suffix accept waw suffix يقبل الاحقة ـو الخاصة بجمع المذكر السالم عند إضافته إلى ما بعده
hm_suffix accept Heh+Meem suffix يقبل اللاحقة ـهم
kal_prefix accept Kaf+Alef+Lam prefix يقبل السابقة كالـ
ha_suffix accept Heh suffix يقبل اللاحقة ـه
k_prefix accept preposition prefixes without "AL" definition article يقبل سابقة الجر دون ال التعريف
annex accept the oral annexation يقبل الإضافة إلى ما بعده مثل المقيمي الصلاة
definition word description شرح الكلمة
note notes about the dictionary entry. ملاحظات على المدخل في القاموس

We can regroup features as:

  • Word: vocalized form of word, with full diacritics, e.g: "ضَرْبَ" [ hit]

  • Basic word features:

    • root of word
    • future type: the mark used in future tense used only for tri letters verbs.
    • word type and category as a sub type
    • root and wazn جذر ووزن
    • lemma (original)
    • gender (مذكر، مؤنث)
    • number (عدد: مفرد، مثنى، جمع)
    • its single if the word is plural
    • its feminin if the word is masculine
    • its irregular plural if exists
    • if the noun is defined ( originaly defined like proper nouns)
  • Features used for search and lookup:

    • Unvocalized: word without diacritics e.g. "ضرب"

    • Normalized form: unify Hamzat and Alefat to find other word forms, e.g. normalized form of "سؤال" is "سءول".

    • Stamped form:

      This feature is used to find lemma from stem with letters variants.

      The stamp is generated by removing letters which can be used as affixation letters (prefix, infix, suffix) such as (ALEF, YEH, WAW, ALEF_MAKSURA, HAMZA, ALEF_HAMZA_ABOVE, WAW_HAMZA, YEH_HAMZA, ALEF_MADDA, SHADDA).

  • Noun inflection:

    Noun can accept affixes or cases like:

    • dualable: accept dual suffix يقبل التثنية
    • feminable: the word accept Teh_marbuta يقبل تاء التأنيث
    • masculin_plural: accept masculine plural يقبل جمع المذكر السالم
    • feminin_plural: accept feminine plural يقبل جمع المؤنث السالم
    • mamnou3_sarf: doesn't accept tanwin ممنوع من الصرف
    • w_suffix : accept waw suffix يقبل الاحقة ـو الخاصة بجمع المذكر السالم عند إضافته إلى ما بعده
    • hm_suffix : accept Heh+Meem suffix يقبل اللاحقة ـهم
    • kal_prefix : accept Kaf+Alef+Lam prefix يقبل السابقة كالـ
    • ha_suffix : accept Heh suffix يقبل اللاحقة ـه
    • k_prefix : accept preposition prefixes without "AL" definition article يقبل سابقة الجر دون ال التعريف

SQL format of noun

CREATE TABLE  IF NOT EXISTS `nouns` (
          `id` int(11) unique,
          `vocalized` varchar(30) DEFAULT NULL,
          `unvocalized` varchar(30) DEFAULT NULL,
          `normalized` varchar(30) DEFAULT NULL,
          `stamp` varchar(30) DEFAULT NULL,
          `wordtype` varchar(30) DEFAULT NULL,
          `root` varchar(10) DEFAULT NULL,
          `wazn` varchar(30) DEFAULT NULL,
          `category` varchar(30) DEFAULT NULL,
          `original` varchar(30) DEFAULT NULL,
          `gender` varchar(30) DEFAULT NULL,
          `feminin` varchar(30) DEFAULT NULL,
          `masculin` varchar(30) DEFAULT NULL,
          `number` varchar(30) DEFAULT NULL,
          `single` varchar(30) DEFAULT NULL,
          `broken_plural` varchar(30) DEFAULT NULL,            
          `defined` tinyint(1) DEFAULT 0,
          `mankous` tinyint(1) DEFAULT 0,
          `feminable` tinyint(1) DEFAULT 0,
          `dualable` tinyint(1) DEFAULT 0,
          `masculin_plural` tinyint(1) DEFAULT 0,
          `feminin_plural` tinyint(1) DEFAULT 0,
          `mamnou3_sarf` tinyint(1) DEFAULT 0,
          `relative` tinyint(1) DEFAULT 0,
          `w_suffix` tinyint(1) DEFAULT 0,
          `hm_suffix` tinyint(1) DEFAULT 0,
          `kal_prefix` tinyint(1) DEFAULT 0,
          `ha_suffix` tinyint(1) DEFAULT 0,
          `k_prefix` tinyint(1) DEFAULT 0,
          `annex` tinyint(1) DEFAULT 0,
          `definition` text,
          `note` text
        ) ;

XML format

<noun id='60000'>
 <vocalized>بَارٌّ</vocalized>
 <unvocalized>بار</unvocalized>
 <normalized>بار</normalized>
 <stamp>بر</stamp>
 <wordtype>اسم فاعل</wordtype>
 <root>برر</root>
 <wazn/>
 <category/>
 <original/>
 <gender>مذكر</gender>
 <feminin/>
 <masculin/>
 <number>مفرد</number>
 <single/>
 <broken_plural>+ون;+ات;أَبْرَارٌ;بَرَرَةٌ</broken_plural>
 <defined/>
 <mankous/>
 <feminable>1</feminable>
 <dualable>1</dualable>
 <masculin_plural>1</masculin_plural>
 <feminin_plural>1</feminin_plural>
 <mamnou3_sarf/>
 <relative/>
 <w_suffix/>
 <hm_suffix/>
 <kal_prefix/>
 <ha_suffix/>
 <k_prefix/>
 <annex/>
 <definition>". ""تَرَكَ ابْناً بَارّاً"" : صَادِقاً وَصَالِحاً وَمُحْسِناً. ""اِبْنُكَ البارُّ يُحِبُّكَ"</definition>
 <note/>
</noun>
...

</dictionary>