Farsi-Verb-Tokenizer

Tokenizes Farsi Verbs.

Usage:

perl farsi-verb-tokenizer.perl [-d resource-directory] input-file [pattern_id]

resource-directory: The directory where resource files are located. This directory by default is "./resource".

input-file: A list of sentences to be tokenized.

pattern_id: The id of the transformation to be applied, a number between 0 and the number of transformations (that is the number of lines in "transformation.tsv") minus 1, inclusive.

output: The list of tokenized sentences.

If no pattern_id is passed, all the transformations will be applied to each individual sentence one by one. The transformations in "resource/transformations.tsv" are sorted from largest to smallest, so that if a larger transformation matches (and hence transforms) a substring of a sentence, smaller transformations could no longer affect that substring.

Example:

perl ./script/farsi-verb-tokenizer.perl test-input.txt > test-output.txt

The output should be exactly the same as test-tokenized.txt. You can check that running the following diff command:

diff test-output.txt test-tokenized.txt

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
doc		doc
java		java
resource		resource
script		script
README.md		README.md
test-input.txt		test-input.txt
test-tokenized.txt		test-tokenized.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

java

java

resource

resource

script

script

README.md

README.md

test-input.txt

test-input.txt

test-tokenized.txt

test-tokenized.txt

Repository files navigation

Farsi-Verb-Tokenizer

About

Releases

Packages

mehdi-manshadi/Farsi-Verb-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Farsi-Verb-Tokenizer

About

Resources

Stars

Watchers

Forks