Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

Open
bloodgroup-cplusplus opened this issue Nov 2, 2023 · 0 comments

Comments

@bloodgroup-cplusplus
Copy link

All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below
1)Since
the size of our corpus is not large, we could have
typed all the data, but we opted for using the OCR
method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts
contained errors due to the “tsha-lag” ◌༹ marker,
which is used to mark the pronunciation of [bj]
in Drenjongke. The use of this marker is unique
to Drenjongke because Tibetan (bodhi) does not have the
sound [bj]..
2)For tokenization, space was set as a delimiter.
Drenjongke script is marked by a syllable marker
called “tsheg” ་, and has a space between potential
morpheme or word boundaries. The use of space in
the orthography is specific to Drenjongke as other
Tibetan languages do not utilize spacing in a sentence.

Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially.
Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant