Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Javanese Script for jav-java #126

Open
Shreeshrii opened this issue Apr 23, 2018 · 55 comments
Open

Add Javanese Script for jav-java #126

Shreeshrii opened this issue Apr 23, 2018 · 55 comments

Comments

@Shreeshrii
Copy link
Contributor

Originally posted in forum

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/8r8YOQgTBT4/xHpCTp9DAwAJ

From: Christopher Imantaka Halim

> Hi,
> 
> I want to develop an OCR for Javanese Script / Aksara.
> https://en.wikipedia.org/wiki/Javanese_script
> 
> Plan on using Tesseract version 4.0
> I've read the wiki but somehow got confused.
> 
> What do I need to prepare, to start the bare minimum training process? (for Tesseract 4.0)
> In some other thread someone said that training using image files are not supported yet.
> Also found out that box file/tiff pairs are not supported also.
> (I did try making one box file, using this online tool: https://pp19dd.com/tesseract-ocr-chopper/)
> 
> Do we have an example of the training "inputs" somewhere on the github projects?
> 
> Sorry if this is a stupid question, I'm a newbie. :)
> 
> Thanks before

@Shreeshrii
Copy link
Contributor Author

  1. Collect training text in Javanese script (Unicode). You will need large number of lines, 500,000 or so to train from scratch. Or, if you can identify a current language/script supported by tesseract which is similar, then you can train by replacing layer, Try to get representative training text of about 50000 lines with 50 words each.

  2. Collect unicode fonts which can correctly render the above text. The more fonts you have the better it will be.

  3. Collect word frequency lists in Javanese script.

  4. Preferably use the linux platform for doing training.

  5. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#training-just-a-few-layers

You can try this as it will be faster than training from scratch.

Please post links to Javanese script related resources below.

If there is a transliterator which convertes Javanese in Latin script to Javanese script, that can be used for converting the files for lang jav as a start.

@Shreeshrii
Copy link
Contributor Author

Do we have an example of the training "inputs" somewhere on the github projects?

See

https://github.com/tesseract-ocr/langdata/tree/master/jav

https://github.com/tesseract-ocr/langdata/blob/master/README.md

@amitdo
Copy link

amitdo commented Apr 23, 2018

before training, he should try best/fast jav.traineddata.

@Shreeshrii
Copy link
Contributor Author

jav is Javanese language in Latin script.

zat kasebut lan kanthi Kategori:Tokoh ing user:OffsBlink para pedunung PL09Puryono| kaya désa
2006 90%; sisih wiwit dan papan wilayah Delengen 5 || ! Wétan, Cathetan € sawijining | saged
amarga Cathetan jaba saka Dominique jiwa. ingkang User:ZorroIII Indonesia 1] langkung NGC

He wants it in Javanese script

The Javanese script, natively known as Aksara Jawa (ꦲꦏ꧀ꦱꦫꦗꦮaksarajawa) and Hanacaraka (ꦲꦤꦕꦫꦏhanacaraka), is an abugida developed by the Javanese people to write several Austronesian languages spoken in Indonesia, primarily the Javanese language and an early form of Javanese called Kawi, as well as Sanskrit, an Indo-Aryan language used as a sacred language throughout Asia. The Javanese script is a descendant of the Brahmi script and therefore has many similarities with the modern scripts of South India and Southeast Asia. The Javanese script, along with the Balinese script, is considered the most elaborate and ornate among Brahmic scripts of Southeast Asia.[1]

This might be similar to Thai/Khmer - could try using that to train from.

@Shreeshrii
Copy link
Contributor Author

http://unicode.org/udhr/d/udhr_jav_java.html

Universal Declaration of Human Rights - Javanese (Javanese)

@Shreeshrii
Copy link
Contributor Author

https://jv.wikipedia.org/wiki/Parembugan:Joko_Widodo

Most of Javanese wikipedia seems to be in Latin script.

@Shreeshrii Shreeshrii changed the title Creating a new language pack for Javanese Script Add Javanese Script for jav-java Apr 23, 2018
@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Apr 23, 2018

@amitdo
Copy link

amitdo commented Apr 23, 2018

Did you unpack jav from best/fast?

@topherseance
Copy link

Hello, thanks a lot for your help. I appreciate it.
Thanks again for providing the links to javanese scripts.

Sorry, I always thought that we need images as training data, but it is not the case for Tesseract 4.0.
:)

Another question, do we have to collect all 500,000 text lines before begin the training?
Can I, lets say, collect only 100 lines, then start the training?
(also well aware that the result may not be good, like overfitting)

@Shreeshrii
Copy link
Contributor Author

100 lines will work only for fine tuning. But you can give it a try to get familiar with training process.

@Shreeshrii
Copy link
Contributor Author

Did you unpack jav from best/fast?

@amitdo I had only looked at langdata. I checked just now after your post. The unicharset in both is in Latin Script only. See below for tessdata_fast version

94
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined	# Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1	# Broken
J 5 0,255,0,255,0,0,0,0,0,0 Latin 73 0 3 J	# J [4a ]A
E 5 0,255,0,255,0,0,0,0,0,0 Latin 58 0 4 E	# E [45 ]A
N 5 0,255,0,255,0,0,0,0,0,0 Latin 61 0 5 N	# N [4e ]A
I 5 0,255,0,255,0,0,0,0,0,0 Latin 66 0 6 I	# I [49 ]A
F 5 0,255,0,255,0,0,0,0,0,0 Latin 84 0 7 F	# F [46 ]A
R 5 0,255,0,255,0,0,0,0,0,0 Latin 74 0 8 R	# R [52 ]A
M 5 0,255,0,255,0,0,0,0,0,0 Latin 63 0 9 M	# M [4d ]A
A 5 0,255,0,255,0,0,0,0,0,0 Latin 60 0 10 A	# A [41 ]A
G 5 0,255,0,255,0,0,0,0,0,0 Latin 65 0 11 G	# G [47 ]A
: 10 0,255,0,255,0,0,0,0,0,0 Common 12 6 12 :	# : [3a ]p
P 5 0,255,0,255,0,0,0,0,0,0 Latin 67 0 13 P	# P [50 ]A
L 5 0,255,0,255,0,0,0,0,0,0 Latin 72 0 14 L	# L [4c ]A
T 5 0,255,0,255,0,0,0,0,0,0 Latin 59 0 15 T	# T [54 ]A
U 5 0,255,0,255,0,0,0,0,0,0 Latin 68 0 16 U	# U [55 ]A
B 5 0,255,0,255,0,0,0,0,0,0 Latin 76 0 17 B	# B [42 ]A
, 10 0,255,0,255,0,0,0,0,0,0 Common 18 6 18 ,	# , [2c ]p
K 5 0,255,0,255,0,0,0,0,0,0 Latin 75 0 19 K	# K [4b ]A
H 5 0,255,0,255,0,0,0,0,0,0 Latin 62 0 20 H	# H [48 ]A
D 5 0,255,0,255,0,0,0,0,0,0 Latin 71 0 21 D	# D [44 ]A
S 5 0,255,0,255,0,0,0,0,0,0 Latin 64 0 22 S	# S [53 ]A
# 10 0,255,0,255,0,0,0,0,0,0 Common 23 4 23 #	# # [23 ]p
Ê 5 0,255,0,255,0,0,0,0,0,0 Latin 78 0 24 Ê	# Ê [ca ]A
- 10 0,255,0,255,0,0,0,0,0,0 Common 25 3 25 -	# - [2d ]p
. 10 0,255,0,255,0,0,0,0,0,0 Common 26 6 26 .	# . [2e ]p
Y 5 0,255,0,255,0,0,0,0,0,0 Latin 69 0 27 Y	# Y [59 ]A
W 5 0,255,0,255,0,0,0,0,0,0 Latin 70 0 28 W	# W [57 ]A
O 5 0,255,0,255,0,0,0,0,0,0 Latin 77 0 29 O	# O [4f ]A
' 10 0,255,0,255,0,0,0,0,0,0 Common 30 10 30 '	# ' [27 ]p
8 8 0,255,0,255,0,0,0,0,0,0 Common 31 2 31 8	# 8 [38 ]0
! 10 0,255,0,255,0,0,0,0,0,0 Common 32 10 32 !	# ! [21 ]p
” 10 0,255,0,255,0,0,0,0,0,0 Common 33 10 33 "	# ” [201d ]p
É 5 0,255,0,255,0,0,0,0,0,0 Latin 79 0 34 É	# É [c9 ]A
? 10 0,255,0,255,0,0,0,0,0,0 Common 35 10 35 ?	# ? [3f ]p
C 5 0,255,0,255,0,0,0,0,0,0 Latin 85 0 36 C	# C [43 ]A
È 5 0,255,0,255,0,0,0,0,0,0 Latin 80 0 37 È	# È [c8 ]A
2 8 0,255,0,255,0,0,0,0,0,0 Common 38 2 38 2	# 2 [32 ]0
; 10 0,255,0,255,0,0,0,0,0,0 Common 39 10 39 ;	# ; [3b ]p
/ 10 0,255,0,255,0,0,0,0,0,0 Common 40 6 40 /	# / [2f ]p
( 10 0,255,0,255,0,0,0,0,0,0 Common 41 10 43 (	# ( [28 ]p
" 10 0,255,0,255,0,0,0,0,0,0 Common 42 10 42 "	# " [22 ]p
) 10 0,255,0,255,0,0,0,0,0,0 Common 43 10 41 )	# ) [29 ]p
1 8 0,255,0,255,0,0,0,0,0,0 Common 44 2 44 1	# 1 [31 ]0
3 8 0,255,0,255,0,0,0,0,0,0 Common 45 2 45 3	# 3 [33 ]0
7 8 0,255,0,255,0,0,0,0,0,0 Common 46 2 46 7	# 7 [37 ]0
“ 10 0,255,0,255,0,0,0,0,0,0 Common 47 10 47 "	# “ [201c ]p
Z 5 0,255,0,255,0,0,0,0,0,0 Latin 87 0 48 Z	# Z [5a ]A
[ 10 0,255,0,255,0,0,0,0,0,0 Common 49 10 50 [	# [ [5b ]p
] 10 0,255,0,255,0,0,0,0,0,0 Common 50 10 49 ]	# ] [5d ]p
| 0 0,255,0,255,0,0,0,0,0,0 Common 51 10 51 |	# | [7c ]
V 5 0,255,0,255,0,0,0,0,0,0 Latin 86 0 52 V	# V [56 ]A
0 8 0,255,0,255,0,0,0,0,0,0 Common 53 2 53 0	# 0 [30 ]0
5 8 0,255,0,255,0,0,0,0,0,0 Common 54 2 54 5	# 5 [35 ]0
— 10 0,255,0,255,0,0,0,0,0,0 Common 55 10 55 -	# — [2014 ]p
_ 10 0,255,0,255,0,0,0,0,0,0 Common 56 10 56 _	# _ [5f ]p
€ 0 0,255,0,255,0,0,0,0,0,0 Common 57 4 57 €	# € [20ac ]
e 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 58 e	# e [65 ]a
t 3 0,255,0,255,0,0,0,0,0,0 Latin 15 0 59 t	# t [74 ]a
a 3 0,255,0,255,0,0,0,0,0,0 Latin 10 0 60 a	# a [61 ]a
n 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 61 n	# n [6e ]a
h 3 0,255,0,255,0,0,0,0,0,0 Latin 20 0 62 h	# h [68 ]a
m 3 0,255,0,255,0,0,0,0,0,0 Latin 9 0 63 m	# m [6d ]a
s 3 0,255,0,255,0,0,0,0,0,0 Latin 22 0 64 s	# s [73 ]a
g 3 0,255,0,255,0,0,0,0,0,0 Latin 11 0 65 g	# g [67 ]a
i 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 66 i	# i [69 ]a
p 3 0,255,0,255,0,0,0,0,0,0 Latin 13 0 67 p	# p [70 ]a
u 3 0,255,0,255,0,0,0,0,0,0 Latin 16 0 68 u	# u [75 ]a
y 3 0,255,0,255,0,0,0,0,0,0 Latin 27 0 69 y	# y [79 ]a
w 3 0,255,0,255,0,0,0,0,0,0 Latin 28 0 70 w	# w [77 ]a
d 3 0,255,0,255,0,0,0,0,0,0 Latin 21 0 71 d	# d [64 ]a
l 3 0,255,0,255,0,0,0,0,0,0 Latin 14 0 72 l	# l [6c ]a
j 3 0,255,0,255,0,0,0,0,0,0 Latin 3 0 73 j	# j [6a ]a
r 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 74 r	# r [72 ]a
k 3 0,255,0,255,0,0,0,0,0,0 Latin 19 0 75 k	# k [6b ]a
b 3 0,255,0,255,0,0,0,0,0,0 Latin 17 0 76 b	# b [62 ]a
o 3 0,255,0,255,0,0,0,0,0,0 Latin 29 0 77 o	# o [6f ]a
ê 3 0,255,0,255,0,0,0,0,0,0 Latin 24 0 78 ê	# ê [ea ]a
é 3 0,255,0,255,0,0,0,0,0,0 Latin 34 0 79 é	# é [e9 ]a
è 3 0,255,0,255,0,0,0,0,0,0 Latin 37 0 80 è	# è [e8 ]a
4 8 0,255,0,255,0,0,0,0,0,0 Common 81 2 81 4	# 4 [34 ]0
6 8 0,255,0,255,0,0,0,0,0,0 Common 82 2 82 6	# 6 [36 ]0
9 8 0,255,0,255,0,0,0,0,0,0 Common 83 2 83 9	# 9 [39 ]0
f 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 84 f	# f [66 ]a
c 3 0,255,0,255,0,0,0,0,0,0 Latin 36 0 85 c	# c [63 ]a
v 3 0,255,0,255,0,0,0,0,0,0 Latin 52 0 86 v	# v [76 ]a
z 3 0,255,0,255,0,0,0,0,0,0 Latin 48 0 87 z	# z [7a ]a
= 0 0,255,0,255,0,0,0,0,0,0 Common 88 10 88 =	# = [3d ]
< 0 0,255,0,255,0,0,0,0,0,0 Common 89 10 90 <	# < [3c ]
> 0 0,255,0,255,0,0,0,0,0,0 Common 90 10 89 >	# > [3e ]
@ 10 0,255,0,255,0,0,0,0,0,0 Common 91 10 91 @	# @ [40 ]p
$ 0 0,255,0,255,0,0,0,0,0,0 Common 92 4 92 $	# $ [24 ]
£ 0 0,255,0,255,0,0,0,0,0,0 Common 93 4 93 £	# £ [a3 ]

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Apr 26, 2018

@topherseance Please see attached zip file which has a test training for Javanese including both Javanese and Latin script. Only trained (replace a layer) upto about 7% accuracy on the small training data that I could gather.

Keep us updated on your progress with training.

jav-traineddatas.zip

@robbyablaze
Copy link

jav-traineddatas.zip

@Shreeshrii hi, I am quite interested in this post, could you give me training data from this? i need to generate javanese script training data compactible with tesseract 3.04/3.05, I want to use that training data for android device, I use tess-two which is not yet compactible with tesseract 4.

@Shreeshrii
Copy link
Contributor Author

generate javanese script training data compactible with tesseract 3.04/3.05

The requirements of training data for tesseract 3.0x are quite different from those for 4.0.0 LSTM training.

You can use jav-java text from UDHR or wikipedia as linked in posts above.

@topherseance
Copy link

Hello, sorry for the hiatus, had other tasks to do.

Only found 2 Javanese fonts so far:

  1. Noto Sans Javanese (by Google)
  2. Tuladha Jejeg (by R.S. Wihananto)

I tried to create a starter traineddata for Noto Sans Javanese by using this command below, it works succesfully:

topher@topher-ubuntu:~$ ~/tesseract/src/training/tesstrain.sh   --fonts_dir ~/tess-javanese/fonts   --lang jav   --linedata_only   --noextract_font_properties   --langdata_dir ~/tesseract/langdata   --tessdata_dir ~/tesseract/tessdata   --fontlist "Noto Sans Javanese"   --output_dir ~/tess-javanese/jav01-train

=== Starting training for language 'jav'
[Sen Jul 9 14:37:14 WIB 2018] /usr/local/bin/text2image --fonts_dir=/home/topher/tess-javanese/fonts --font=Noto Sans Javanese --outputbase=/tmp/font_tmp.l81FA3YVZ2/sample_text.txt --text=/tmp/font_tmp.l81FA3YVZ2/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.l81FA3YVZ2
Stripped 1 unrenderable words
Rendered page 0 to file /tmp/font_tmp.l81FA3YVZ2/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Noto Sans Javanese
[Sen Jul 9 14:37:16 WIB 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.l81FA3YVZ2 --fonts_dir=/home/topher/tess-javanese/fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0 --max_pages=0 --font=Noto Sans Javanese --text=/home/topher/tesseract/langdata/jav/jav.training_text
Rendered page 0 to file /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset --norm_mode 1 /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box
Extracting unicharset from box file /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box
Word started with a combiner:0xa9b8
Normalization failed for string 'ꦸ'
Word started with a combiner:0xa9bc
Word started with a combiner:0xa981
Normalization failed for string 'ꦼꦁ'
Word started with a combiner:0xa9b8
Normalization failed for string 'ꦸ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦠ'
Word started with a combiner:0xa9bc
Normalization failed for string 'ꦼ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲ'
Word started with a combiner:0xa9b6
Word started with a combiner:0xa981
Normalization failed for string 'ꦶꦁ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Wrote unicharset file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset -O /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset -X /tmp/tmp.TYPQCfx2ed/jav/jav.xheights --script_dir=/home/topher/tesseract/langdata
Loaded unicharset of size 16 from file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦏ
Warning: properties incomplete for index 11 = ꦥꦺ
Warning: properties incomplete for index 12 = ꦝ
Warning: properties incomplete for index 13 = ꦪ
Warning: properties incomplete for index 14 = ꦗ
Warning: properties incomplete for index 15 = ꧉
Writing unicharset to file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/topher/tesseract/tessdata
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/tesseract /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.2-342-g12f4 with Leptonica
Page 1

=== Constructing LSTM training data ===
[Sen Jul 9 14:37:17 WIB 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset --script_dir /home/topher/tesseract/langdata --words /home/topher/tesseract/langdata/jav/jav.wordlist --numbers /home/topher/tesseract/langdata/jav/jav.numbers --puncs /home/topher/tesseract/langdata/jav/jav.punc --output_dir /home/topher/tess-javanese/jav01-train --lang jav
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.wordlist
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.punc
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.numbers
Loaded unicharset of size 16 from file /tmp/tmp.TYPQCfx2ed/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦏ
Warning: properties incomplete for index 11 = ꦥꦺ
Warning: properties incomplete for index 12 = ꦝ
Warning: properties incomplete for index 13 = ꦪ
Warning: properties incomplete for index 14 = ꦗ
Warning: properties incomplete for index 15 = ꧉
Config file is optional, continuing...
Failed to read data from: /home/topher/tesseract/langdata/jav/jav.config
Null char=2
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.box to /home/topher/tess-javanese/jav01-train
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.tif to /home/topher/tess-javanese/jav01-train
Moving /tmp/tmp.TYPQCfx2ed/jav/jav.Noto_Sans_Javanese.exp0.lstmf to /home/topher/tess-javanese/jav01-train

Created starter traineddata for language 'jav'


Run lstmtraining to do the LSTM training for language 'jav'


But when I tried to do the same for Tuladha Jejeg font, it shows this error:

topher@topher-ubuntu:~$ ~/tesseract/src/training/tesstrain.sh   --fonts_dir ~/tess-javanese/fonts   --lang jav   --linedata_only   --noextract_font_properties   --langdata_dir ~/tesseract/langdata   --tessdata_dir ~/tesseract/tessdata   --fontlist "Tuladha Jejeg"   --output_dir ~/tess-javanese/jav02-train

=== Starting training for language 'jav'
[Sen Jul 9 14:45:14 WIB 2018] /usr/local/bin/text2image --fonts_dir=/home/topher/tess-javanese/fonts --font=Tuladha Jejeg --outputbase=/tmp/font_tmp.uAxequREYg/sample_text.txt --text=/tmp/font_tmp.uAxequREYg/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.uAxequREYg
Rendered page 0 to file /tmp/font_tmp.uAxequREYg/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Tuladha Jejeg
[Sen Jul 9 14:45:16 WIB 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.uAxequREYg --fonts_dir=/home/topher/tess-javanese/fonts --strip_unrenderable_words --leading=32 --xsize 2560 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0 --max_pages=0 --font=Tuladha Jejeg --text=/home/topher/tesseract/langdata/jav/jav.training_text
Rendered page 0 to file /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset --norm_mode 1 /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.box
Extracting unicharset from box file /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.box
Word started with a combiner:0xa9bc
Word started with a combiner:0xa981
Normalization failed for string 'ꦼꦁ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦠ'
Word started with a combiner:0xa9bc
Normalization failed for string 'ꦼ'
Word started with a combiner:0xa9c0
Normalization failed for string '꧀ꦲ'
Word started with a combiner:0xa9b6
Word started with a combiner:0xa981
Normalization failed for string 'ꦶꦁ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Word started with a combiner:0xa983
Normalization failed for string 'ꦃ'
Word started with a combiner:0xa9b6
Normalization failed for string 'ꦶ'
Wrote unicharset file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset -O /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset -X /tmp/tmp.k4Fb5CaR5k/jav/jav.xheights --script_dir=/home/topher/tesseract/langdata
Loaded unicharset of size 17 from file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:/home/topher/tesseract/langdata/Javanese.unicharset
Warning: properties incomplete for index 3 = ꧋
Warning: properties incomplete for index 4 = ꦱꦸ
Warning: properties incomplete for index 5 = ꦒ
Warning: properties incomplete for index 6 = ꦫ
Warning: properties incomplete for index 7 = ꦮꦸ
Warning: properties incomplete for index 8 = ꦮꦺ
Warning: properties incomplete for index 9 = ꦤ
Warning: properties incomplete for index 10 = ꦮ
Warning: properties incomplete for index 11 = ꦏ
Warning: properties incomplete for index 12 = ꦥꦺ
Warning: properties incomplete for index 13 = ꦝ
Warning: properties incomplete for index 14 = ꦪ
Warning: properties incomplete for index 15 = ꦗ
Warning: properties incomplete for index 16 = ꧉
Writing unicharset to file /tmp/tmp.k4Fb5CaR5k/jav/jav.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=/home/topher/tesseract/tessdata
[Sen Jul 9 14:45:17 WIB 2018] /usr/local/bin/tesseract /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0 lstm.train
Tesseract Open Source OCR Engine v4.0.0-beta.2-342-g12f4 with Leptonica
Page 1
Empty page!!
Empty page!!
ERROR: /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.lstmf does not exist or is not readable


The tesseract/langdata/jav contains only one file: jav.training_text, the contents are one line javanese text:

꧋ꦱꦸꦒꦼꦁꦫꦮꦸꦃꦮꦺꦤ꧀ꦠꦼꦤ꧀ꦲꦶꦁꦮꦶꦏꦶꦥꦺꦝꦶꦪꦃꦗꦮꦶ꧉

(taken from here)

Opened the /tmp/ folder, look after /tmp/tmp.k4Fb5CaR5k/jav/jav.Tuladha_Jejeg.exp0.tif, and I think it is rendered correctly. Here's the file: jav.Tuladha_Jejeg.exp0.zip
Did I do everything right? Sorry if it was a rookie mistake.

Another info, Javanese script, by Unicode standard, has glyph-combining letters. (See Pasangan)
Tuladha Jejeg uses SIL Graphite to do the combination, whereas Noto Sans Javanese uses OpenType's ligatures and anchors.
(I think OpenType has a wider range of compatibility and support than SIL Graphite, for example, Chrome browser doesn't support SIL Graphite, causing javanese scripts won't render correctly for that font)
Does it has anything to do with the error?

Thanks before

@Shreeshrii
Copy link
Contributor Author

Text2image uses Pango for font rendering. It is possible that it does not support the SIL graphite fonts. I also get errors for Annapurna SIL devanagari font and do not use it.

@Shreeshrii
Copy link
Contributor Author

I think I had used a couple.more fonts

@topherseance
Copy link

I see.. but why it seems the resulting .tif image is rendered correctly?
It is the same when you compare it with the image from wikipedia link I provided.

@topherseance
Copy link

We plan on using the OCR for old textbook scans written in javanese script.
So far, Tuladha Jejeg font is the most similar font with those found in old textbooks.
Noto Sans Javanese looks a bit more 'modern'.

@topherseance
Copy link

Just tested with other text strings, some of them worked, some did not. here's what we found:
ꦲꦤꦕꦫꦏ simple phrase, no gylph-combining --> success
ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼ uses glyph-combining: Pasangan --> success
ꦲꦤꦕꦫꦏꦮꦺꦤ꧀ꦠꦼꦮꦸ uses glyph-combining: Sandhangan --> failure

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jul 9, 2018 via email

@Shreeshrii
Copy link
Contributor Author

I ignored the errors and continued with training using 5 fonts which seem to cover javanese code range.

eration 29986: ALIGNED TRUTH : ꦤ꧀ ꦄꦩꦺꦫꦶꦏ ꦒꦝꦃ ꦠꦁꦒꦺꦭ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦥꦺꦏꦭꦺꦴꦔꦤ꧀ ꦧꦚ꧀ꦗꦸꦂ ꦭ
Iteration 29986: BEST OCR TEXT : ꦤ꧀ ꦄꦩꦺꦫꦶꦏ ꦒꦝꦃ ꦠꦁꦒꦺꦭ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦥꦺꦏꦭꦺꦴꦔꦤ꧀ ꦧꦚ꧀ꦗꦂ ꦭ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Noto_Sans_Javanese.exp2.lstmf page 100 :
Mean rms=1.605%, delta=4.39%, train=13.896%(52.571%), skip ratio=0.4%
Iteration 29987: ALIGNED TRUTH : ꦶꦠꦸꦠ꧀ ꦏꦺꦱꦺꦤꦶꦪꦤ꧀ ꦱꦩ꧀ꦥꦸꦤ꧀ ꦗꦁꦏꦺꦥ꧀ ꦫꦺꦏꦺꦴꦂ ꦥꦶꦪ ꦲꦺꦩ꧀ꦥ꧀ꦭꦺꦴꦏ꧀ ꦲꦺꦩ꧀ꦧꦃꦏꦏꦸꦁ ꦲꦺꦩ꧀ꦧꦃꦥꦸꦠꦿ
Iteration 29987: BEST OCR TEXT : ꦱꦶꦠꦸꦠ꧀ ꦏꦺꦱꦺꦤꦶꦪꦤ꧀ ꦱꦩ꧀ꦥꦸꦤ꧀ ꦗꦁꦏꦺꦥ꧀ ꦫꦺꦏꦺꦂ ꦥꦶꦪ ꦲꦺꦩꦺꦏ꧀ ꦲꦺꦩ꧀ꦧꦃꦏꦏꦸꦁ ꦲꦺꦩ꧀ꦧꦃ ꦥꦸꦠꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 31 :
Mean rms=1.605%, delta=4.392%, train=13.901%(52.578%), skip ratio=0.4%
Iteration 29988: ALIGNED TRUTH : ꦤꦺꦴꦧꦺꦭ꧀ ꦱꦱ꧀ꦠꦿ ꦤꦺꦴꦧꦺꦭ꧀ ꦱ
Iteration 29988: BEST OCR TEXT : ꦏꦤꦺꦴꦧꦺꦭ꧀ ꦱꦱꦿ ꦤꦺꦴꦠꦺꦭ꧀ ꧈
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp-1.lstmf page 499 :
Mean rms=1.605%, delta=4.392%, train=13.884%(52.578%), skip ratio=0.4%
Iteration 29989: ALIGNED TRUTH : ꦼꦱꦶꦲꦗꦶ ꦮꦼꦱꦶ ꦮꦼꦱ꧀ꦠ ꦮꦼꦱ꧀ꦥꦢ
Iteration 29989: BEST OCR TEXT : ꦼꦱꦶꦲꦗꦼ ꦮꦼꦱꦶ ꦮꦼꦱ꧀ꦠ ꦥꦼꦱ꧀ꦥꦢ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp1.lstmf page 123 :
Mean rms=1.605%, delta=4.386%, train=13.875%(52.559%), skip ratio=0.4%
Iteration 29990: ALIGNED TRUTH : ꦱ
Iteration 29990: BEST OCR TEXT : ꦕꦱ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp-2.lstmf page 362 :
Mean rms=1.606%, delta=4.395%, train=13.961%(52.617%), skip ratio=0.4%
Iteration 29991: ALIGNED TRUTH : ꦩꦪꦸꦫ ꦩꦫꦁ ꦩꦫꦏꦂꦩ ꦩꦫꦏꦠ
Iteration 29991: BEST OCR TEXT : ꦩꦥꦪꦸꦫ ꦩꦫꦁ ꦩꦫꦏꦂꦩ ꦩꦫꦏꦠ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Tuladha_Jejeg.exp2.lstmf page 685 (Perfect):
Mean rms=1.605%, delta=4.391%, train=13.966%(52.592%), skip ratio=0.4%
Iteration 29992: ALIGNED TRUTH : ꦩꦺꦤ꧀ꦠ꧀ ꦩꦫꦺꦠ꧀ ꦠꦻꦴꦤ꧀ ꦝꦺꦮꦺꦏꦺ ꦲꦺꦤ꧀ꦠꦸꦏ꧀ ꦮꦶꦒꦠꦶ ꦱꦔꦺꦠ꧀ ꦱꦠꦸꦁ ꦒꦭ
Iteration 29992: BEST OCR TEXT : ꦩꦺꦤ꧀ꦠ꧀ꦩꦫꦺꦠ꧀ ꦠꦻꦴꦤ꧀ ꦝꦺꦮꦺꦏꦺꦲꦺꦤ꧀ꦠꦸꦏ꧀ꦮꦶꦒꦠꦶꦱ ꦔꦺꦠ꧀ ꦱꦠꦸꦁ ꦒꦭ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp0.lstmf page 604 :
Mean rms=1.605%, delta=4.391%, train=13.965%(52.612%), skip ratio=0.4%
Iteration 29993: ALIGNED TRUTH : ꦤ꧀ ꦱꦶꦗꦶ ꦏꦧꦸꦥꦠꦺꦤ꧀ ꦠꦥꦤꦸꦭꦶ ꦧꦒꦺꦪꦤ꧀ ꦱꦏꦶꦁ ꦆꦧꦸꦏꦸꦛ ꦏꦺꦕꦩ
Iteration 29993: BEST OCR TEXT : ꦤ꧀ ꦱꦶꦗꦶꦏꦧꦸꦥꦠꦺꦤ꧀ ꦠꦥꦤꦸꦭꦶꦧꦒꦺꦪꦤ꧀ ꦱꦏꦶꦁ ꦆꦧꦸꦤ꧀ꦛ ꦏꦺꦕꦩ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 162 :
Mean rms=1.605%, delta=4.39%, train=13.965%(52.591%), skip ratio=0.4%
Iteration 29994: ALIGNED TRUTH : ꦱꦮꦶꦱꦺ ꦏꦧꦸꦥꦠꦺꦤ꧀ ꦕꦶꦪꦚ꧀ꦗꦸꦂ ꦱꦏ ꦠꦻꦴꦤ꧀ ꦥꦿꦺꦴꦮ꦳ꦶꦤ꧀ꦱꦶ ꦭꦶꦩ꧀ꦧꦸꦂꦒ꧀ ꦢꦶꦮ ꦩꦺꦤꦺꦲꦶ
Iteration 29994: BEST OCR TEXT : ꦱꦮꦶꦱꦺꦏꦧꦸꦥꦠꦺꦤ꧀ ꦕꦶꦪꦚ꧀ꦗꦸꦂ ꦱꦏ ꦠꦻꦴꦤ꧀ ꦥꦿꦺꦴꦮ꦳ꦶꦤ꧀ꦱꦶ ꦭꦶꦩ꧀ ꦧꦂꦒ꧀ ꦢꦶꦮ ꦩꦺꦤꦺꦲꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 759 :
Mean rms=1.604%, delta=4.386%, train=13.948%(52.549%), skip ratio=0.4%
Iteration 29995: ALIGNED TRUTH : ꦝꦺ ꦝꦠꦺꦁ ꦏꦱꦸꦭ꧀ꦠꦤꦤ꧀ ꦢꦺꦩꦏ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦏꦭꦶꦠꦶꦢꦸ ꦥꦂꦠꦻ ꦒꦺꦴꦭ꧀ꦏꦂ
Iteration 29995: BEST OCR TEXT : ꦝꦺꦝꦠꦺꦴꦁ ꦏꦱꦸꦭ꧀ꦠꦤꦤ꧀ ꦢꦺꦩꦏ꧀ ꦏꦺꦕꦩꦠꦤ꧀ ꦏꦭꦶꦠꦶꦢꦸꦥꦂ ꦠꦻ ꦒꦺꦴꦭ꧀ ꦏꦂ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp-2.lstmf page 160 :
Mean rms=1.603%, delta=4.37%, train=13.868%(52.512%), skip ratio=0.4%
Iteration 29996: ALIGNED TRUTH : ., ꦭꦶꦩ꧀ꦥꦢ꧀ ꦭꦶꦩ꧀ꦥꦸꦁ ꦭꦶꦁꦒꦶꦃ ꦭꦶꦁꦒ ꦭꦶꦁꦱꦁ ꦭꦶꦁꦱꦶꦂ ꦭꦶꦁꦱꦼꦩ꧀
Iteration 29996: BEST OCR TEXT : . , ꦭꦶꦩ꧀ꦥꦢ꧀ ꦭꦶꦩ꧀ꦥꦸꦁ ꦭꦶꦁꦒꦶꦃ ꦭꦶꦁꦒ ꦭꦶꦁꦱꦁ ꦭꦶꦁꦱꦶꦂ ꦭꦶꦱꦼꦩ꧀
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan_Anyar.exp2.lstmf page 183 :
Mean rms=1.603%, delta=4.372%, train=13.869%(52.514%), skip ratio=0.4%
Iteration 29997: ALIGNED TRUTH : ꦥꦿꦏꦫꦤ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦮꦶꦢꦸꦫ
Iteration 29997: BEST OCR TEXT : ꦥꦿꦏꦫꦤ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦥꦿꦏꦫ ꦥꦿꦏꦮꦶꦱ꧀ ꦥꦿꦏꦱꦶꦠ ꦥꦿꦏꦱ ꦮꦶꦢꦸꦫ
File ./jav_java-layer_train/jav_java.Carakan-Unicode.exp0.lstmf page 321 (Perfect):
Mean rms=1.603%, delta=4.372%, train=13.869%(52.514%), skip ratio=0.4%
Iteration 29998: ALIGNED TRUTH : ꦤꦭꦶꦏ ꦏꦸꦮꦶ ꦒꦩ꧀ꦧꦂ:ꦥ꦳꧀ꦭꦒ꧀ ꦲꦺꦴꦥ꦳꧀ ꦲꦶꦁꦏꦁ ꦏꦺꦢꦃ ꦢꦤ꧀ ꦱꦺ
Iteration 29998: BEST OCR TEXT : ꦤꦭꦶꦏ ꦏꦸꦮꦶꦒꦩ꧀ꦧꦂꦥ꦳꧀ ꦭꦒ꧀ ꦲꦺꦴꦥ꦳꧀ ꦲꦶꦁꦏꦁ ꦏꦺꦢꦃ ꦢꦤ꧀ ꦱꦺ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan-Unicode.exp-1.lstmf page 3 :
Mean rms=1.602%, delta=4.368%, train=13.857%(52.475%), skip ratio=0.4%
Iteration 29999: ALIGNED TRUTH : ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶꦃ ꦲꦱꦶꦤ꧀ ꦲꦱꦶꦫꦤ꧀ ꦲꦱꦭ꧀ ꦲꦸꦱꦸꦭ꧀ - ꦲꦱꦭ꧀ ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶ
Iteration 29999: BEST OCR TEXT : ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶꦃ ꦲꦱꦶꦤꦏ꧀ ꦲꦱꦶꦫꦤ꧀ ꦲꦱꦭ꧀ꦲꦸꦱꦸꦁꦭ꧀ - ꦲꦱꦭ꧀ ꦲꦱꦱ꧀ꦠ ꦲꦱꦶꦂ ꦲꦱꦶ
File /tmp/tmp.hW6IvPk7gK/jav_java/jav_java.Carakan-Unicode.exp1.lstmf page 71 :
Mean rms=1.602%, delta=4.368%, train=13.852%(52.464%), skip ratio=0.4%
At iteration 28598/30000/30058, Mean rms=1.602%, delta=4.368%, char train=13.852%, word train=52.464%, skip ratio=0.4%,  New worst char error = 13.852 wrote checkpoint.

Finished! Error rate = 13.384

@topherseance
Copy link

topherseance commented Jul 18, 2018

Can you please share the commands and steps you did for the above training?

I still can't get the training to work successfully. I used the "training from scratch" method. Again, sorry if it is a newbie mistake.
Also, I couldn't find a .unicharset file in the langdata/jav, should I need to create one?
The resulting log contains many occurences of this:

Encoding of string failed! Failure bytes: ffffffea ffffffa6 ffffff83 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff88 20 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffba ffffffea ffffffa6 ffffffb4 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff89 20 ffffffea ffffffa6 ffffff8f ffffffea ffffffa6 ffffffbc ffffffea ffffffa6 ffffffa4 ffffffea ffffffa6 ffffffad ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffad ffffffea ffffffa6 ffffffa4 ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffa5 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffa4 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffff8f ffffffea ffffffa7 ffffff80 ffffffea ffffffa6 ffffff8f ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffbc ffffffea ffffffa6 ffffffa7 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa0 ffffffea ffffffa7 ffffff80 ffffffea ffffffa7 ffffff88
Can't encode transcription: 'ꦏꦧꦺꦃꦩꦲꦸ꧈ ꦲꦶꦪꦺꦴꦲꦏ꧀ꦩꦸ꧉ ꦏꦼꦤꦭꦶꦭꦤ꧀ꦥꦲꦩꦤꦲꦏ꧀ꦲꦏ꧀ꦏꦱꦼꦧꦸꦠ꧀꧈' in language ''

I did run the unicharset_extractor with a .txt file containing javanese texts. Here's the resulting unicharset file:

jav.unicharset.txt

Each line contains 0,255,0,255,0,0,0,0,0,0, I guess it is some sort of coordinates, is it the correct value?
Or maybe I should use it anyway? The unicharset file you copy-pasted earlier in this thread also contains 0,255,0,255,0,0,0,0,0,0 for each line.

@topherseance
Copy link

Found another font: Prada

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Jul 19, 2018 via email

@amitdo
Copy link

amitdo commented Jul 19, 2018

Encoding of string failed! Failure bytes: ffffffea ffffffa6 ffffff83 ffffffea ffffffa6 ffffffa9 ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa7 ffffff88

The text is clearly not encoded in utf-8.

@Shreeshrii
Copy link
Contributor Author

Can you please share the commands and steps you did for the above training?

Please see https://github.com/Shreeshrii/tessdata_jav_java

@topherseance
Copy link

I collected few javanese aksara here, probably has about several thousand textlines:
https://github.com/topherseance/bible_javanese_aksara

@topherseance
Copy link

@Shreeshrii when you run your scripts, layertrain.sh or plustrain.sh, did you receive the Encoding of string failed error?

I ran the script and still got this:

File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp1.lstmf page 569 :
Mean rms=5.024%, delta=42.402%, train=100.11%(100%), skip ratio=61.7%
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8b ffffffea ffffffa6 ffffffb1 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffff81 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb6 ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffaa ffffffea ffffffa6 ffffffb2 ffffffea ffffffa6 ffffffb6 20 ffffffea ffffffa6 ffffffa2 ffffffea ffffffa6 ffffffb8 ffffffea ffffffa6 ffffffa9
Can't encode transcription: 'ꦮꦺꦠꦤ꧀​ꦱꦶꦁ ꦢꦶꦪꦪꦲꦶ ꦢꦸꦩ' in language ''
Iteration 1171: ALIGNED TRUTH : ꦩꦭꦁꦲꦠꦺꦤꦶ ꦧꦸꦩꦶ ꦧꦸꦩ꧀ꦥꦼꦠ꧀
Iteration 1171: BEST OCR TEXT : 
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Tuladha_Jejeg.exp0.lstmf page 480 :
Mean rms=5.024%, delta=42.397%, train=100.111%(100%), skip ratio=61.7%
Iteration 1172: ALIGNED TRUTH : ꦄꦢꦩ꧀ ꦩꦭꦶꦏ꧀ ꦮ꦳ꦶꦢꦺꦪꦺꦴ ꦒ
Iteration 1172: BEST OCR TEXT : 
File /tmp/tmp.fAmoYPWBIL/jav_java/jav_java.Carakan_Anyar.exp-1.lstmf page 40 :
Mean rms=5.023%, delta=42.37%, train=100.113%(100%), skip ratio=61.7%
Iteration 1173: ALIGNED TRUTH : ꦏꦤ꧀ꦕ꧀ꦂꦶꦠ꧀ ꦏꦤ꧀ꦛꦶꦁ
Iteration 1173: BEST OCR TEXT : 

Checked the encoding of jav.training_text, I guess it is encoded in UTF-8

topher@topher-ubuntu:~/tesseract/langdata/jav$ file -i jav.training_text
jav.training_text: text/plain; charset=utf-8

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 6, 2018 via email

@topherseance
Copy link

Just some lines, I guess.
My locale is EN.

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 7, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 7, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 7, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 9, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 11, 2018 via email

@topherseance
Copy link

Done converting this file to aksara jawa:
https://github.com/tesseract-ocr/langdata_lstm/blob/master/jav/jav.training_text
Result:
https://github.com/topherseance/javanese-aksara-training-text

But what about the other files?
For example, .numbers, .wordlist.
Is the .numbers file correct? Seems to contain random letters..

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 14, 2018 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 14, 2018 via email

@topherseance
Copy link

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Aug 15, 2018 via email

@gindrawan
Copy link

Hi, first I'm sorry perhaps this is different topic but I think quite related. I've opened an issue at #152 (Balinese script OCR) but stiil confuse (newbie syndrom) until finally landed here.

I'm ready for collecting training text but still on hold since same case like Javanese fonts, Balinese script has Bali SImbar Dwijendra font (see posted issue) most similar to ancient script but not yet tested for training (I'm afraid same incompatibility issue like Tuladha Jejeg, will check soon). At the other side Balinese script also has Noto Sans/Seri Balinese from Google.

Also, I've download https://github.com/Shreeshrii/tessdata_jav_java, and its README.md said "Source code changes will be needed in tesseract... "

Could you direct me how to use all material here since Javanese script has big influence to Balinese script. Geograhically, Bali and Java are also neighbor to each other.

Thank you very much before for your kind attention.

@Shreeshrii
Copy link
Contributor Author

I had done aksara jawa training and created two traineddata files - see links given in https://github.com/Shreeshrii/tessdata_jav_java/blob/master/README.md
But I am not sure how accurate those are or whether @topherseance did further training on same.

The changes to tesseract codebase were made via:

tesseract-ocr/tesseract@0eb7be1#diff-eaafd22a79065f5b8d28318d482e650d

tesseract-ocr/tesseract@7957288#diff-eaafd22a79065f5b8d28318d482e650d

tesseract-ocr/tesseract@b34cf9d#diff-eaafd22a79065f5b8d28318d482e650d

@gindrawan
Copy link

Thanks for the quick response @Shreeshrii

Here is the update condition:

  1. At the attachment, we have 2 fonts with Balinese-unicode, namely Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and Noto Sans Balinese (like Javanese that has Noto Sans Javanese).
  2. I want to use https://github.com/Shreeshrii/tessdata_jav_java as a base for training with my Balinese training text. See the attachment for the Balinese version of Article 1 of the Universal Declaration of Human Rights (https://en.wikipedia.org/wiki/Balinese_script). And about three code for that text, I don't know, jav for javanese, bal for balinese?

The question is how do I do that? For several hours try to learn and get the strategy but still far away..

bal.training_text.txt

balinese-unicode.zip

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 23, 2020 via email

@gindrawan
Copy link

UDHR is a small text. You will need larger text for training. LSTM training takes time, days and weeks.

On Mon, Mar 23, 2020, 10:17 gindrawan @.***> wrote: Thanks for the quick response @Shreeshrii https://github.com/Shreeshrii Here is the update condition: 1. At the attachment, we have 2 fonts with Balinese-unicode, namely Vimala (most similar to the non-Balinese-unicode Bali SImbar Dwijendra) and Noto Sans Balinese (like Javanese that has Noto Sans Javanese). 2. I want to use https://github.com/Shreeshrii/tessdata_jav_java as a base for training with my Balinese training text. See the attachment for the Balinese version of Article 1 of the Universal Declaration of Human Rights (https://en.wikipedia.org/wiki/Balinese_script). And about three code for that text, I don't know, jav for javanese, bal for balinese? The question is how do I do that? For several hours try to learn and get the strategy but still far away.. bal.training_text.txt https://github.com/tesseract-ocr/langdata/files/4367222/bal.training_text.txt balinese-unicode.zip https://github.com/tesseract-ocr/langdata/files/4367183/balinese-unicode.zip — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#126 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I4MBIBCLURDZ7ZFML3RI3SXFANCNFSM4E35V7YA .

Yes, I know that.. I want to start from small training text first and incrementally add later (if possible) while get more understanding to the learning process. I have already had larger training text in Noto Sans Balinese (up to 30 thousand words and possibility doubling it for Vimala). More likely, the number continue to grow since there are other sources that still haven't processed yet. I dont know if that number is enough..

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 23, 2020 via email

@Shreeshrii
Copy link
Contributor Author

Shreeshrii commented Mar 23, 2020 via email

@gindrawan
Copy link

Ok, thanks. I'll post the update at #152.

@bennylin
Copy link

bennylin commented Sep 5, 2020

@topherseance: if you're still looking for the Javanese OCR, a team in UKDW is working on it.

@bennylin
Copy link

bennylin commented Sep 5, 2020

@Shreeshrii & @topherseance: there are more than 20 Javanese script fonts available here:
https://bennylin.github.io/keyboards/jawa-fonts.html

@Shreeshrii
Copy link
Contributor Author

@bennylin Are these Unicode fonts?

@bennylin
Copy link

bennylin commented Sep 5, 2020

Yes

@Shreeshrii
Copy link
Contributor Author

Are there any labelled datasets with scanned images and their Unicode groundtruth transcription that can be used for training/testing tesseract's jav-java traineddata?

What accuracy did the UKDW ocr achieve?

@bennylin
Copy link

bennylin commented Sep 5, 2020

I'm not in the loop for the research. You might want to contact Dr. Lucia Krisnawati for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants