HengamTagger or Parstdex (persian time date extractor)

Description

Parstdex (knwon as HengamTagger in our paper at aacl) is a rule-based Persian temporal extractor built on top of regular expressions specifying pattern units and patterns that can match temporal expressions.

How to Install parstdex

pip install parstdex

How to use

from parstdex import Parstdex

model = Parstdex()

sentence = """ماریا شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به نادیا زنگ زد اما تا سه روز بعد در تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش. خبری از نادیا نشد"""

Extract spans

model.extract_span(sentence)

output :

{"datetime": [[6, 47], [68, 78], [82, 111]], "date": [[6, 10], [68, 78], [82, 111]], "time": [[11, 47]]}

Extract markers

model.extract_marker(sentence)

{
   "datetime":{
      "[6, 47]":"شنبه عصر راس ساعت ۱۷ و بیست و سه دقیقه به",
      "[68, 78]":"سه روز بعد",
      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
   },
   "date":{
      "[6, 10]":"شنبه",
      "[68, 78]":"سه روز بعد",
      "[82, 111]":"تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش."
   },
   "time":{
      "[11, 47]":"عصر راس ساعت ۱۷ و بیست و سه دقیقه به"
   }
}

Extract TimeML scheme

model.extract_time_ml(sentence)

output :

ماریا 
<TIMEX3 type='DATE'>
شنبه
</TIMEX3>
<TIMEX3 type='TIME'>
عصر راس ساعت ۱۷ و بیست و سه دقیقه به
</TIMEX3>
 نادیا زنگ زد اما 
<TIMEX3 type='DURATION'>
تا سه روز بعد
</TIMEX3>
 در 
<TIMEX3 type='DATE'>
تاریخ ۱۸ شهریور سال ۱۳۷۸ ه.ش.
</TIMEX3>
خبری از نادیا نشد

Extract markers' NER tags

DATTIM mode (Default):

model.extract_ner(sentence, mode="dattim")

output :

[
    ("ماریا", "O"),
    ("شنبه", "B-DAT"),
    ("عصر", "B-TIM"),
    ("راس", "I-TIM"),
    ("ساعت", "I-TIM"),
    ("۱۷", "I-TIM"),
    ("و", "I-TIM"),
    ("بیست", "I-TIM"),
    ("و", "I-TIM"),
    ("سه", "I-TIM"),
    ("دقیقه", "I-TIM"),
    ("به", "I-TIM"),
    ("نادیا", "O"),
    ("زنگ", "O"),
    ("زد", "O"),
    ("اما", "O"),
    ("تا", "B-DAT"),
    ("سه", "I-DAT"),
    ("روز", "I-DAT"),
    ("بعد", "I-DAT"),
    ("در", "I-DAT"),
    ("تاریخ", "I-DAT"),
    ("۱۸", "I-DAT"),
    ("شهریور", "I-DAT"),
    ("سال", "I-DAT"),
    ("۱۳۷۸", "I-DAT"),
    ("ه", "I-DAT"),
    (".", "I-DAT"),
    ("ش", "I-DAT"),
    (".", "I-DAT"),
    ("خبری", "O"),
    ("از", "O"),
    ("نادیا", "O"),
    ("نشد", "O"),
]

TMP mode:

model.extract_ner(sentence, mode="tmp")

output :

[
    ("ماریا", "O"),
    ("شنبه", "B-TMP"),
    ("عصر", "I-TMP"),
    ("راس", "I-TMP"),
    ("ساعت", "I-TMP"),
    ("۱۷", "I-TMP"),
    ("و", "I-TMP"),
    ("بیست", "I-TMP"),
    ("و", "I-TMP"),
    ("سه", "I-TMP"),
    ("دقیقه", "I-TMP"),
    ("به", "I-TMP"),
    ("نادیا", "O"),
    ("زنگ", "O"),
    ("زد", "O"),
    ("اما", "O"),
    ("تا", "B-TMP"),
    ("سه", "I-TMP"),
    ("روز", "I-TMP"),
    ("بعد", "I-TMP"),
    ("در", "I-TMP"),
    ("تاریخ", "I-TMP"),
    ("۱۸", "I-TMP"),
    ("شهریور", "I-TMP"),
    ("سال", "I-TMP"),
    ("۱۳۷۸", "I-TMP"),
    ("ه", "I-TMP"),
    (".", "I-TMP"),
    ("ش", "I-TMP"),
    (".", "I-TMP"),
    ("خبری", "O"),
    ("از", "O"),
    ("نادیا", "O"),
    ("نشد", "O"),
]

File Structure:

Parstdex architecture is very flexible and scalable and therefore suggests an easy solution to adapt to new patterns which haven't been considered yet.

├── parstdex                 
│   └── utils
|   |   └── annotation
|   |   |   └── ...
|   |   └── pattern
|   |   |   └── ...
|   |   └── special_words
|   |   |   └── words.txt
|   |   └── const.py
|   |   └── normalizer.py
|   |   └── pattern_to_regex.py
|   |   └── deprecation.py
|   |   └── regex_tool.py
|   |   └── spans.py
|   |   └── tokenizer.py
|   └── marker_extractor.py
|   └── settings.py
└── Test           
│   └── data.json
|   └── test_parstdex.py
|      
└── examples.py
└── performance_test.ipynb
└── requirement.txt
└── setup.py

Performance Test

Executable codes and performance test results are accessible on google colab.

The average time required to obtain temporal expressions is 6 ms. This test was conducted using 264 sentences with an average length of 50 characters that covered all of the patterns.

How to contribute

Please feel free to provide us with any feedback or suggestions. You can find more information on how to contribute to Parstdex by reading the contribution document.

Citation

If you use any part of this repository in your research, please cite it using the following BibTex entry.

@inproceedings{mirzababaei-etal-2022-hengam,
	title        = {Hengam: An Adversarially Trained Transformer for {P}ersian Temporal Tagging},
	author       = {Mirzababaei, Sajad  and Kargaran, Amir Hossein  and Sch{\"u}tze, Hinrich  and Asgari, Ehsaneddin},
	year         = 2022,
	booktitle    = {Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing},
	publisher    = {Association for Computational Linguistics},
	address      = {Online only},
	pages        = {1013--1024},
	url          = {https://aclanthology.org/2022.aacl-main.74}
}

Name		Name	Last commit message	Last commit date
Latest commit History 346 Commits
.github/workflows		.github/workflows
docs		docs
parstdex		parstdex
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
examples.py		examples.py
performance_test.ipynb		performance_test.ipynb
requirements.txt		requirements.txt
setup.py		setup.py

License

kargaranamir/parstdex

Folders and files

Latest commit

History

Repository files navigation

HengamTagger or Parstdex (persian time date extractor)

Description

How to Install parstdex

How to use

Extract spans

Extract markers

Extract TimeML scheme

Extract markers' NER tags

DATTIM mode (Default):

TMP mode:

File Structure:

Performance Test

How to contribute

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages