wrong default mapping of some Romanian diacritics #37

latrau · 2018-02-10T08:01:25Z

Environment

Debian Linux

Tesseract Version: tesseract 4.00.00alpha
Platform: Linux 4.15.0 SMP PREEMPT 2018 x86_64 GNU/Linux

Current Behavior:

using the ron option (Romanian):

romanian diacritics șȘțȚ are mapped into the wrong Unicode codes, namely:
Ș -> Ş=U+015E
ș -> ş=U+015F
Ț -> Ţ=U+0162
ț -> ţ=U+0163

Expected Behavior:

Ș -> Ș=U+0218
ș -> ș=U+0219
Ț -> Ț=U+021A
ț -> ț=U+021B

Suggested Fix:

edit the map accordingly;

zdenop · 2018-02-10T08:38:22Z

Where is input image or something that would demonstrate problem?

latrau · 2018-02-10T21:28:47Z

the Romanian typographical convention is that the diacritics s and t are with a comma below not with cedilla (as specified also in UNICODE Latin ext A and B).

best would be that any diacritical s or t in the -ron (Romanian) option should be mapped into Latin ext B code above; meaning that in the tesseract's ron unicharset there should be no trace of [15e ] [15f ] [162 ] or [163 ], only [218 ]-[21a ].

e.g.

the wrong mapping is everywhere once the -ron option is selected...

let me quote UNICODE 10 (chap.07) on this:

The Unicode Standard provides unambiguous representations for all of the forms, for
example, U+0219 ș latin small letter s with comma below versus U+015F ş latin
small letter s with cedilla. In modern usage, the preferred representation of Roma-
nian text is with U+0219 ș latin small letter s with comma below, while Turkish data
is represented with U+015F ş latin small letter s with cedilla.

same goes for ȘțȚ.

so option -ron means șțȚȘ [U+0218-A] with no ambiguity and should nowhere involve şŞŢţ [U+015e-f][U+0162-3].

amitdo · 2020-05-12T20:13:39Z

This issue is not caused by Tesseract itself. It should be moved to another repo (not sure which one).

stweil · 2020-05-13T06:28:20Z

I think langdata_lstm is a good one and transfer the issue.

stweil · 2020-05-13T06:33:52Z

@latrau, so each of the wrong characters should be replaced? Do you want to send a pull request which fixes ron.training_text, maybe also ron.singles_text and ron.wordlist?

stweil · 2020-05-13T06:37:02Z

@latrau, was cedilla used in historic Romanian texts? If yes, it might be a good idea to keep both forms (with cedilla for the historic characters and with comma for the modern ones).

stweil transferred this issue from tesseract-ocr/tesseract May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong default mapping of some Romanian diacritics #37

wrong default mapping of some Romanian diacritics #37

latrau commented Feb 10, 2018

zdenop commented Feb 10, 2018

latrau commented Feb 10, 2018 •

edited

amitdo commented May 12, 2020

stweil commented May 13, 2020

stweil commented May 13, 2020

stweil commented May 13, 2020

wrong default mapping of some Romanian diacritics #37

wrong default mapping of some Romanian diacritics #37

Comments

latrau commented Feb 10, 2018

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

zdenop commented Feb 10, 2018

latrau commented Feb 10, 2018 • edited

amitdo commented May 12, 2020

stweil commented May 13, 2020

stweil commented May 13, 2020

stweil commented May 13, 2020

latrau commented Feb 10, 2018 •

edited