Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't load OSD data #33

Open
knowtheory opened this issue Mar 7, 2014 · 5 comments
Open

Can't load OSD data #33

knowtheory opened this issue Mar 7, 2014 · 5 comments

Comments

@knowtheory
Copy link

Hey meh,

Just looking for some quick advice. I've managed to get ruby-tesseract-ocr working with page_segmentation_mode 1 on ubuntu (12.04) and the OSD trained data.

I'm having trouble doing the same on OSX (mavericks) unfortunately. I've got tesseract installed via homebrew, and despite the fact that I can use the default tesseract CLI wrapper to extract text using the OSD data, i can't manage the same using ruby-tesseract-ocr. The tesseract CLI has a --list-langs options which displays "osd" as one of the options.

Despite that, this keeps happening:

2.1.0 :010 >   tesseract = Tesseract::Engine.new{ |e| 
2.1.0 :011 >       e.language               = LANGUAGE
2.1.0 :012?>     e.page_segmentation_mode = 1
2.1.0 :013?>   }
 => #<Tesseract::Engine:0x00000101aed998 @api=#<Tesseract::API:0x00000101aed830 @internal=#<FFI::AutoPointer address=0x00000102dae230>>, @initializing=false, @init=#<Proc:0x00000101aed948@(irb):10>, @path=".", @language=:ukr, @mode=:DEFAULT, @variables={}, @config=[], @rectangle=[], @psm=1> 
2.1.0 :014 > blocks = tesseract.blocks_for(sideways)
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
 => [#<Tesseract::Block(61.34318161010742): "шзешшцш:\n\n">, #<Tesseract::Block(63.370262145996094): "шшёо\n\n">, #<Tesseract::Block(60.00260543823242): "...Е .пьшцс ю\xD1сюоцюм Еьозцоьао\xD1\n\n">, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(62.234920501708984): "„м8\n\nюоы .ЦЕнЦФЕ „лноіг\xD1ёоцгьь юшозаоьаои\nщит: .шЕ дозцоьао.. о: оьодош\nтої ...Еь .шьшцс юЕБоцшм штбзцоьао:\n\nтот ...чоьнчгп „льоіьъдёоцъць шшозцоьао:\n\nыююю ...гг . .дюоц \xD1:..==ш...:оЕ льоїашш мшозцоьао: оьолош\nтм. .ІЕ „\xD1шьтаьзш \xD1зтз\xD1юоцзшо\xD1дцшьшм ш шьшц: ш\xD1ьюоцшм\n\nю .цоькцец .\xD1юьшаьзш \xD1зтз\xD1юоцзщошдцшьшм ш льосЕдёоцёць\n\nїю .ІВ „зьтцьзш _т::юоц:шо_._д:ш._шю\n\nї З ь ч _ ю ы г\n\n">, #<Tesseract::Block(48.451499938964844): "ттылчоцлїцлёщ\n\n">, #<Tesseract::Block(49.25178146362305): "3.93 ю-у_ш< о\xD1шцсёо\xD1 :::ёшцьоцсш\n\n">, #<Tesseract::Block(0.0): nil>] 

Do you have any advice as to whether i'm missing a config thing somewhere? I'm mostly perplexed because, as far as i can tell, the data is in the right place, and everything else works (no compile errors or anything either).

@meh
Copy link
Owner

meh commented Mar 7, 2014

This is a duplicate of #23, but not having an OS X system prevents me from doing any debugging in regard to that.

Sincerely I think it's an issue with how tesseract-ocr is compiled on OS X since the library doesn't export anything to define load paths from what I recall.

In short, the only advice I have is to look carefully at the configure options when building tesseract-ocr and hope for the best.

@knowtheory
Copy link
Author

Alrighty, thanks @meh, i'll try to take a poke around homebrew's tesseract recipe. The thing that i don't quite get is how the PSM settings look for the osd.traineddata in a manner that's different than the main mechanism for loading language training data (since i'm ocring non-english documents just fine).

@meh
Copy link
Owner

meh commented Mar 7, 2014

If you look at the examples/nerdz-captcha-breaker/break.rb source, it doesn't do any path fiddling, it basically just looks for tessdata in the same directory the script is ran from.

This means the load paths for language files are a compile time option.

EDIT: wait, it actually does a Tesseract.prefix = './', guess that's what should be done if you have your language files in different directories from the standard ones.

@bwinterling
Copy link

@knowtheory curious if you had any luck digging into the OSx related issues? Would be nice to play with the other Tesseract configs, like segmentation mode and custom configs. But I get the same errors mentioned above. Not sure I have the experience to help debug, but I'll probably give it a shot if I have time.

@knowtheory
Copy link
Author

@bwinterling unfortunately, no i haven't had time to dig in :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants