Make `TessdataManager` able to save archive using LibArchive #4187

sadra-barikbin · 2024-02-04T06:03:32Z

Made TessDataManager able to save archive using LibArchive
Added -t option to combine_tessdata to transform proprietary .traineddata to archive file.

sadra-barikbin · 2024-02-04T06:07:16Z

@stweil , shall I add a test?

stweil · 2024-02-07T20:57:37Z

src/ccutil/tessdatamanager.cpp

  ASSERT_HOST(is_loaded_);
  std::vector<char> data;
  Serialize(&data);
  if (writer == nullptr) {
+#if defined(HAVE_LIBARCHIVE)
+    return SaveArchiveFile(filename);


With this change TessdataManager::SaveFile will always write traineddata files in ZIP format which are incompatible with Tesseract binaries which were build without LibArchive. I'm afraid that would cause problems for a lot of people.

I thought libarchive can deduce archive types.

Yes, it can, but the current code uses archive_write_set_format_zip instead of archive_write_set_format_by_name, so it will always write a ZIP file. And of course libarchive cannot write the proprietary traineddata format.

stweil · 2024-02-07T21:14:05Z

Added -t option to combine_tessdata to transform proprietary .traineddata to archive file.

I'd prefer a more general solution which allows different target formats. In addition it should allow writing to a different output file and use long options. So the syntax might look like this:

Usage: combine_tessdata --convert [--format TARGET_FORMAT] INFILE [OUTFILE]

TARGET_FORMAT would default to zip, but should allow any typical file extension of archive files and also traineddata for conversions into the proprietary Tesseract format.

egorpugin · 2024-02-07T21:15:38Z

Isn't automatic format simpler?
Libarchive writes files based on their extension.
If we write .zip - it will be .zip.
.tar.gz - tar.gz
.tar.xz or lz - ...

stweil · 2024-02-08T06:45:57Z

That's right, and this feature of LibArchive would also be used to implement my suggested solution.

If we implement support for combine_tessdata --convert eng.traineddata eng.zip, that would create a filename which is currently unsupported by tesseract, so an additional renaming mv eng.zip eng.traineddata would be required. Should we extend the Tesseract code to support more extensions than the current .traineddata, so -l eng.zip would work? Or do you think of another solution? Maybe eng.traineddata.zip?

zdenop · 2024-02-08T07:51:11Z

Is there also intention to read such converted data by tesseract?

If yes, than please be careful about changing extension: it will break a lot of workflows that looks for available/installed languages (AFAIR also GetAvailableLanguagesAsVector a.k.a tesseract --list-langs)

Nowadays it is quite common to use private file extension instated of indication it is archive (e.h. xlsx, odt are zip archives)

On other hand: if file extension will not be changed and tesseract will be build without libarchive support, that has to be improved error handling why tesseract is not able read traineddata...

sadra-barikbin · 2024-03-04T13:10:23Z

I thought this feature is decided on to be implemented. Now it seems it's an arguable one. My own requirement i.e. inspecting config file in the .traineddata file and possibly overwriting it is fulfilled with the sequence of combine_tessdata -e and combine_tessdata -o. I don't see a strong point in having this feature either. Instead, I could work on PageXML renderer if you agree.

Do the improvement

fc70f76

stweil reviewed Feb 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `TessdataManager` able to save archive using LibArchive #4187

Make `TessdataManager` able to save archive using LibArchive #4187

sadra-barikbin commented Feb 4, 2024

sadra-barikbin commented Feb 4, 2024

stweil Feb 7, 2024

egorpugin Feb 7, 2024

stweil Feb 8, 2024 •

edited

stweil commented Feb 7, 2024

egorpugin commented Feb 7, 2024

stweil commented Feb 8, 2024

zdenop commented Feb 8, 2024

sadra-barikbin commented Mar 4, 2024 •

edited

Make TessdataManager able to save archive using LibArchive #4187

Are you sure you want to change the base?

Make TessdataManager able to save archive using LibArchive #4187

Conversation

sadra-barikbin commented Feb 4, 2024

sadra-barikbin commented Feb 4, 2024

stweil Feb 7, 2024

Choose a reason for hiding this comment

egorpugin Feb 7, 2024

Choose a reason for hiding this comment

stweil Feb 8, 2024 • edited

Choose a reason for hiding this comment

stweil commented Feb 7, 2024

egorpugin commented Feb 7, 2024

stweil commented Feb 8, 2024

zdenop commented Feb 8, 2024

sadra-barikbin commented Mar 4, 2024 • edited

Make `TessdataManager` able to save archive using LibArchive #4187

Make `TessdataManager` able to save archive using LibArchive #4187

stweil Feb 8, 2024 •

edited

sadra-barikbin commented Mar 4, 2024 •

edited