Feat/generate trainingsets #205

M3ssman · 2020-11-18T10:44:50Z

Include generation of Trainingdata Sets from OCR like ALTO V3, PAGE 2013, PAGE 2019 and Image Files (tif, jpeg)

kba

I haven't tested it yet, but looks very promising. I particularly appreciate the unit tests 👍

kba · 2020-11-18T12:53:35Z

generate_sets.py

+    "-m",
+    "--minchars",
+    required=False,
+    help="Minimum chars for a line")


An explicit default value would be better. In sets/training_sets.py there is a constant DEFAULT_MIN_CHARS that is used in generate_sets.py but the min_chars kwarg to TrainingSets.create is 8.

Also, why 8/16? It's very common to have valid shorter lines, like the last word of a sentence on a new line, lines in narrow columns, dramas etc.

Historical note: Originates from newspaper digitalization project, where a common text line (no adds) has usually more than 20 chars. In fact, we only took care for lines with at least 32 chars, since I thought these lines were more valuable for training than shorter lines because of having more characters to learn.
But personally I'm totally free about this, so what about, say, 4 chars?

It makes sense to skip short lines for training, but the fact that there is a minimum of chars and what that minimum is should be clearly communicated to the user, so he/she isn't surprised why some lines are skipped.

And yes, probably something low like 4 (documented in ./generate_sets.py --help would be best IMHO.

It probably makes sense not to skip short lines for training. Tesseract was trained only with artifical long lines initially, and the standard models have problems with short lines (typically the page numbers, but also short lines ending paragraphs). We know that there exist valid lines with only a single character for page numbers (1 … 9, a … z, A … Z). Why should we skip lines with one or two characters as long as they are valid?

This originates from the decision to prefer longer lines because they provide more characters. I thought more characters means more training character training material, and more material increases pattern recognition accuracy. But this doesn't pay much attention to a characters' context. In newspapers advertises I've seen many lines that are way shorter than 8 chars, only containing abbreviations and alike. Maybe focusing on "regular article lines" is another reason why Tesseract (4.1.1) usually performs rather worse in this realm, compared to single-column-liners.

@stweil Do you suggest to turn the minchars arg into being completely optional, or to set the default value to "1", to skip lines that only contain non-printable characters?

Setting minchars to 1 sounds reasonable. I cannot imagine how a line which only contains non-printable characters would look like.

I agree with Stefan we should make minchars optional and try to make Tesseract learn short lines well. Not sure how the LSTM implementation here unrolls, but short lines should create fewer weight updates, so characters would still contribute "democratically" – just that there's more incentive to get a better transition from the initial state.

generate_sets.py

requirements.txt

sets/training_sets.py

kba · 2020-11-18T13:01:02Z

sets/training_sets.py

+        """
+
+        if self.revert:
+            return reduce(lambda c, p: p + ' ' + c, self.text_words)


Since we already require python-bidi, it would probably more robust to use it for handling the inversion, c.f. https://github.com/MeirKriheli/python-bidi/blob/master/bidi/algorithm.py / https://github.com/MeirKriheli/python-bidi#api

Thanks for the hint, I'll take a look!

Hm,
I guess we need to go without bidi so far, since it looks the output from mixed arab + latin lines makes them turn from rtl to ltr. Lines with only arab chars and indic numbers seem to work pretty well with bidi, but mixed don't.

I'm no export on that, though. Please, take a look yourself. I've added the bidi import, adopted line content generation (the commented section). Feel free to switch implementations. (in your preferred IDE place a debugging mark in test_create_sets_from_page2013_and_jpg to inspect the temporary test files written)

Do I get it right: bidi works on char-Level? If so, I don't think it is useful in this scenario. I only know some (rather poor) arabic output generated from tesseract itself which is word-based.

kba · 2020-11-18T17:36:41Z

I've tested it now, unit tests pass and I managed to extract image-text pairs from the kant_aufklaerung_1784 sample in assets:

$ python3 ./generate_sets.py -d ../assets/data/kant_aufklaerung_1784/data/OCR-D-GT-PAGE/PAGE_0017_PAGE.xml -i ../assets/data/kant_aufklaerung_1784/data/OCR-D-IMG/INPUT_0017.tif 
[SUCCESS] created '20' training data sets, please review

It would be useful to make -o required or at least print the output directory as part of the SUCCESS message.

Could the -i argument be optional and by default be derived from imageFilename (PAGE) / sourceImageInformation/filename (ALTO)?

We also need a section on at least the CLI usage in the README.md

M3ssman · 2020-11-19T11:47:17Z

For the arabic text that is included as text resource (288652), and that's causing trouble with bidi, please see the original image (binarized)

Shreeshrii · 2020-12-07T11:27:25Z

@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.

kba · 2020-12-07T12:04:38Z

@kba Do you know of any Devanagari or any other Indic language datasets in Page XML format? I only have scanned page images and and their groundtruth in text format. I don't think those will work with this PR.

Sorry, I do not. But maybe you have OCR results in Devanagari to test the mechanics of this PR? What problems do you foresee with Devanagari?

Shreeshrii · 2020-12-08T02:40:51Z

What problems do you foresee with Devanagari?

I don't foresee any, but wanted to test with complex scripts, just in case there is any difference in processing.

maybe you have OCR results in Devanagari to test the mechanics of this PR?

Good idea. I can test using ALTO output from tesseract.

Devanagari or any other Indic language datasets in Page XML format

I found a set of files at https://github.com/ramayanaocr/ocr-comparison/tree/master/Transkribus/Input, which has the png files as well as the xml files (generated by transkribus, I guess). I tested with one of those files, while the console messages reported success, the files were not created. The summary option created a file, but the file had empty lines.

 tesstrain-extract-gt  /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png
[INFO   ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: False, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review

I tested with the Arabic image shared earlier in this thread with its xml file in resources, just to make sure that I had the PR installed correctly. That worked i.e. created the files. I haven't looked at the text within them.

tesstrain-extract-gt /home/ubuntu/tesstrain/tests/resources/xml/288652.xml -i /home/ubuntu/pagedeva/288652.png -o /home/ubuntu/pagedeva/output -s
[INFO   ] generate trainingsets of '/home/ubuntu/tesstrain/tests/resources/xml/288652.xml' with '/home/ubuntu/pagedeva/288652.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '33' training data sets in '/home/ubuntu/pagedeva/output', please review

Is there a compatibility issue with transkribus generated PAGE files?

Shreeshrii · 2020-12-12T09:43:27Z

I tested just now with ALTO output from tesseract and get the following warnings:

 tesstrain-extract-gt /home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml -i /home/ubuntu/tesstrain-San/test/iast/sandocs_2.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.xml' with '/home/ubuntu/tesstrain-San/test/iast/sandocs_2.png' (min: 1, sum: True, reorder: False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:234: RuntimeWarning: Degrees of freedom <= 0 for slice
  keepdims=keepdims)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:195: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/home/ubuntu/miniforge3/lib/python3.7/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
[SUCCESS] created '5' training data sets in 'training_data_sandocs_2', please review

EDIT: Earlier error with ALTO was because of typo in filename.

M3ssman · 2020-12-13T09:49:47Z

@Shreeshrii Thanks for pointing to PAGE-Files that miss `Word' elements at all!

Since that was the cause for the missing results in the provided Devanagari sample. I tried to fix this and integrated the file as new test resource. Unfortunately, I can't say a word about the textual outcome, so please update the PR and have a look again ...

Shreeshrii · 2020-12-14T15:37:47Z

@M3ssman I tried just now but am getting the same result as before.

 git log -3
commit 3fb94996ac42818b302850080a6f2535db12251e (HEAD -> pagesets)
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Sun Dec 13 10:44:47 2020 +0100

    [app][fix] handle page without word elements

commit 2f3566bc23a848e3df7801b2fa1a6ce1d417e7bc
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Mon Dec 7 14:19:58 2020 +0100

    [app][fix] filter invalid lines

commit 57ba229ace0c9ae74afb889916cba3555ef7b4d0
Author: M3ssman <uwe.hartwig@bitsrc.info>
Date:   Mon Dec 7 13:18:48 2020 +0100

    [app][test] fix test imports

 tesstrain-extract-gt  /home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml -i /home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/ocr-comparison/Transkribus/Input/page/ram110.xml' with '/home/ubuntu/ocr-comparison/Transkribus/Input/ram110.png' (min: 1, sum: True, reorder: False)
[SUCCESS] created '24' training data sets in 'training_data_ram110', please review

However, only the summary file is created in 'training_data_ram110'. File is attached.

ram110_summary.gt.txt

PS: I looked at the XML file and the Devanagari text in it has errors, so it is probably raw OCRed text and not corrected text for groundtruth.

Shreeshrii · 2020-12-14T16:01:09Z

I also tried with the ALTO 4.1 XML referenced in the issue I opened at OCR-D/ocrd_fileformat#23
That fails with the following messages:

(base) ubuntu@tesseract-ocr-1:~/tesstrain-pagesets$ tesstrain-extract-gt /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml -i /home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png -s
[INFO   ] generate trainingsets of '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.xml' with '/home/ubuntu/OCR_GS_Data/TypeFaces/persian_watts_typeface/data/ahsan_at_tavarikh_31.png' (min: 1, sum: True, reorder: False)
Traceback (most recent call last):
  File "/home/ubuntu/miniforge3/bin/tesstrain-extract-gt", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/cli.py", line 74, in main
    reorder=REORDER)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 351, in create
    self.xml_data, min_len=min_chars, reorder=reorder)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 184, in text_line_factory
    ns_prefix = _determine_namespace(xml_data)
  File "/home/ubuntu/miniforge3/lib/python3.7/site-packages/generate_sets/training_sets.py", line 223, in _determine_namespace
    return [k for (k, v) in XML_NS.items() if v == root_tag][0]
IndexError: list index out of range

M3ssman · 2020-12-14T20:12:19Z

@Shreeshrii Thanks for pointing towards ALTO V4. I've missed this before, since we're using the latest official stable release, tesseract 4.1., which doesn't create this kind of ALTO data. I've added the ALTO V4 namespace declaration and it worked fine. Somehow, I found this surprising, since the ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?

Regarding the Devanagari Issue: Your git log looks well, the version matches. Maybe tesstrain-extract-gt in your current, active environment is outdated, so please drop it and do a fresh install afterwards. You can also do a pytest -v to run the so far included test cases (with their test datasets) and check the temporary outputs in your local /tmp/pytest-of-<account> dir.

Shreeshrii · 2020-12-15T05:04:27Z

ALTO V4 data from OpenITI you pointed out looks quite unfamiliar, having String CONTENT spanned over a complete textline. I've never seen this before. Where does this data come from?

I do not know more than the info available online. Please see
https://github.com/OpenITI/RELEASE
and
https://zenodo.org/record/4075046#.X9hC0dgzaUk

M3ssman · 2020-12-15T08:14:28Z

@Shreeshrii Please note, test images are just created on-the-fly, with a library that is out-of-the-box just able to render a very small subset of UTF-8 chars, I guess only ASCII, neither arabic, persian, devanagari or old german fracture letters. This was introduced to keep test data small and free from binary image stuff. It only gives you a hint whether the lines would match the "words".

M3ssman · 2020-12-15T08:19:49Z

@Shreeshrii Regarding the lastest version: currently, there's only a-pre-beta-version (0.0.1) annotated in the setup.py. Usually this would be the place to follow versioning. I do not know how to utilize some sort of repository information straight at this point. Maybe @kba can give us a hint?

Shreeshrii · 2020-12-15T08:59:15Z

@M3ssman Thanks for the explanations regarding test files.

Maybe tesstrain-extract-gt in your current, active environment is outdated, so please drop it and do a fresh install afterwards.

You were right about this.

I removed tesstrain-extract-gt from the bin directories and reinstalled in the environment where ocrd is installed. It works now. All the tif and gt.txt were created for the Transkribus Devanagari file.

The alto4.1 Persian file is also generating line images and text. (I haven't checked regarding the RTL issue yet).

This is great!! Thank you.

M3ssman · 2020-12-15T12:44:13Z

@Shreeshrii You're welcome!

... Sorry for the confusion regarding RTL ... finally, it turned out that the -r flag aims at something different than real RTL which can be handled with py-bidi. If active, it only re-arranges word tokens by top-left-corner in descending order, starting from right margin. Therefore I renamed it to --reorder. It doesn't turn characters. I had to deal with arabic PAGE-XML exported from Transkribus, having inconsistent reading-orders and display artifacts and almost made me go crazy.

Since this relies on individual coordinates for each token, I'm afraid it will have no effect on test resources like the ones gathered from OpenITI which only have a single String@CONTENT element that represents a text line in total (or at least more than just one word). Reordering this way requires proper coordinates below text line level: We can't just chop the lines and reorder tokens, since the source order of elements of a plain text line is certainly not always reliable.

stale · 2021-01-15T10:44:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shreeshrii · 2021-01-18T07:17:44Z

This should not be closed. It needs review by someone familiar with RTL languages.

lgtm-com · 2021-01-18T20:47:07Z

This pull request introduces 4 alerts when merging f3e73e4 into fa57d61 - view on LGTM.com

new alerts:

3 for __init__ method calls overridden method
1 for 'import *' may pollute namespace

M3ssman · 2021-01-19T11:23:42Z

I've been talking with https://github.com/galdring , a colleague, about this review and he's out to get us somebody.

zdenop · 2023-01-09T16:09:45Z

@M3ssman: can you please update your PR to current git code (python code is in src see Migrate Python code to a dedicated package)

M3ssman · 2023-01-26T14:37:25Z

@zdenop Sorry for the late reply.

What layout do you prefer?
<project_root>/src/extract_sets or integrate training_sets.py somehow into <project_root>/src as part of <project_root>/src/tesstrain ?

stefan6419846 · 2023-01-26T14:52:07Z

If I understood @zdenop correctly, the final goal is to make everything available through the tesstrain Python package in the end. As you provide a dedicated entry point, src/tesstrain sounds like the appropriate package.

Nevertheless, I am not sure about the external dependencies. They might should be made optional (extras_require).

M3ssman · 2023-01-26T15:18:24Z

@stefan6419846
Thanks for your reply! Do you suggest to push these dependencies into setuptool.setup.extras_require?

stefan6419846 · 2023-01-26T15:23:05Z

@M3ssman If you are going to integrate the training set generator into the existing Python package, I would suggest yes. At least for me they appear to be overkill for most users which just want to use the basic artificial training functionality.

extract_sets/training_sets.py

setup.py

extract_sets/training_sets.py

bertsky · 2023-01-26T15:35:22Z

README.md

+
+`tesstrain-extract-sets` currently supports OCR data in ALTO V3, PAGE 2013 and PAGE 2019, as well as TIFF, JPEG and PNG images.
+
+Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`).


Isn't padding for raw images going to be a desaster? I'd recommend making this combination disallowed in the CLI right away.

bertsky · 2023-01-26T15:36:29Z

README.md

+
+Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`).
+
+By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`. 


Suggested change

By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`.

By default, several optimization actions are performed at image line level, like deskewing or removal of top-bottom intruders. To disable this, add flag `--no-sanitize`.

bertsky · 2023-01-26T15:40:58Z

extract_sets/training_sets.py

+
+import exifread
+import lxml.etree as etree
+import numpy as np


I don't understand. Why would you want to remove this import, which is clearly required, @stweil? And why do you say it's WIP, @M3ssman?

bertsky · 2023-01-26T15:46:24Z

extract_sets/training_sets.py

+    * drawing artificial border
+    * collect only contours that touch this
+    * get contours that are specific ratio to close to the edge
+    * fill those with specific grey tone


I don't think this operation will be helpful for raw images. For binarized, it may improve, but grey untextured fill is certainly going to irritate the pixel pipeline (as it introduces artificial edges etc). It's also not realistic (not going to be seen at inference), so forcing the models to learn this is not a good idea.

Also, didn't you write a textured fill (grey_canvas IIRC) for that very purpose (but for synthetic training) already?

bertsky · 2023-01-26T15:49:00Z

extract_sets/training_sets.py

+      only if so, enhance img to prevent rotation
+      black area artifacts with constant padding
+    * rotate
+    * slice rotation result due previous padding


Doing all this on the line-level image calls for trouble:

skew detection via Hough transform is much less reliable than on the region level

derotation introduces white corners, which you then have to fill in – again, detrimental to raw/rgb images

bertsky · 2023-01-26T16:14:42Z

Nevertheless, I am not sure about the external dependencies. They might should be made optional (extras_require).

At least for me they appear to be overkill for most users which just want to use the basic artificial training functionality.

I disagree with that assessment. The pkg for synthetic training is as relevant as some way to import from the widely used file formats (ALTO, PAGE) for real GT training IMO. So if the trainingsets extension is adopted (at all), then its dependencies should not be moved to extras_require.

extract_sets/cli.py

extract_sets/training_sets.py

setup.py

src/tesstrain/training_sets.py

+            fhdl.writelines(contents)
+
+
+def calculate_grayscale(low=168, neighbourhood=32, in_data=None):


src/tesstrain/training_sets.py

+    return tuple(map(lambda c: sum(c) / len(c), zip(*point_pairs)))
+
+
+def to_center_coords(elem, namespace, vertical=False):


src/tesstrain/training_sets.py

+        self.set_id()
+        self.set_text()
+        if self.valid:
+            self.reorder = reorder


src/setup.py

@@ -5,6 +5,8 @@

 ROOT_DIRECTORY = Path(__file__).parent.resolve()

+installation_requirements = open('requirements.txt', encoding='utf-8').read().split('\n')


src/tesstrain/training_sets_cli.py

+    do_opt = args.sanitize
+    intrusion_ratio = args.intrusion_ratio
+    if isinstance(intrusion_ratio, str) and ',' in intrusion_ratio:
+        intrusion_ratio = [float(n) for n in intrusion_ratio.split(',')]


src/tesstrain/training_sets_cli.py

+    if isinstance(intrusion_ratio, str) and ',' in intrusion_ratio:
+        intrusion_ratio = [float(n) for n in intrusion_ratio.split(',')]
+    else:
+        intrusion_ratio = float(intrusion_ratio)


kba reviewed Nov 18, 2020

View reviewed changes

M3ssman requested a review from kba November 19, 2020 06:17

kba requested review from Shreeshrii and stweil November 19, 2020 11:10

kba mentioned this pull request Nov 26, 2020

Page level images #7

Open

This comment has been minimized.

Sign in to view

M3ssman mentioned this pull request Dec 16, 2020

Report on RTL training with OCR_GS_Data for Arabic #128

Open

stale bot added the stale Issues which require input by the reporter which is not provided label Jan 15, 2021

stale bot removed the stale Issues which require input by the reporter which is not provided label Jan 18, 2021

M3ssman added 5 commits June 3, 2021 17:45

[app][feat] dpi from png too

b4edc80

[app][fix] center from coord pairs mean

9d67309

[app][doc] update readme

cd8853f

[app][rfct] clear err msg

5fa6ade

[app][rfct] clear error subject

cafe6e0

bertsky mentioned this pull request Jul 15, 2021

[app][feat] create training data from alto or page #199

Closed

Merge branch 'tesseract-ocr:master' into feat/generate-trainingsets

7acd59f

bertsky mentioned this pull request Aug 26, 2021

docs/ocrd-training: export from OCR-D toolchain OCR-D/ocrd-website#101

Closed

Merge branch 'master' into tmp

f076cf8

M3ssman added 3 commits January 26, 2023 16:12

[ci/cd][fix] pin latest working opencv

dbb0553

[ci/cd][upd] follow test layout

31c617b

[app][fix] opencv usage concentrated

a5a9be7

github-advanced-security bot found potential problems Jan 26, 2023

View reviewed changes

extract_sets/training_sets.py Fixed Show fixed Hide fixed

setup.py Fixed Show fixed Hide fixed

setup.py Fixed Show fixed Hide fixed

extract_sets/training_sets.py Fixed Show fixed Hide fixed

extract_sets/training_sets.py Fixed Show fixed Hide fixed

bertsky reviewed Jan 26, 2023

View reviewed changes

Uwe Hartwig added 3 commits September 20, 2023 15:54

[ci/cd][feat] update build

cb33288

[app][fix] update impl

f7afb3c

Merge remote-tracking branch 'upstream/main' into tmp_merge

4ab7b4d

github-advanced-security bot found potential problems Sep 21, 2023

View reviewed changes

extract_sets/cli.py Fixed Show fixed Hide fixed

extract_sets/training_sets.py Fixed Show fixed Hide fixed

setup.py Fixed Show fixed Hide fixed

setup.py Fixed Show fixed Hide fixed

[app][rfct] fit src/tesstrain layout

78a126c

github-advanced-security bot found potential problems Sep 21, 2023

View reviewed changes

[app][feat] gather pairs from data from dirs

90ccf29

github-advanced-security bot found potential problems Sep 22, 2023

View reviewed changes

Uwe Hartwig added 2 commits September 22, 2023 13:31

[app][feat] aggr n of created pairs

80ee2d5

[app][fix] dont remove from same list

736e73e


		`tesstrain-extract-sets` currently supports OCR data in ALTO V3, PAGE 2013 and PAGE 2019, as well as TIFF, JPEG and PNG images.

		Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`).


		Output is written as UTF-8 encoded plain text files and TIFF images. The image frame is produced from the textline coordinates in the OCR data, so please take care of properly annotated geometrical information. Additionally, the tool can add a fixed synthetic padding around the textline or store it binarized (`--binarize`).

		By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`.

	By default, several sanitize actions are performed at image line level, like deskewing or removement of top-bottom intruders. To disable this, add flag `--no-sanitze`.
	By default, several optimization actions are performed at image line level, like deskewing or removal of top-bottom intruders. To disable this, add flag `--no-sanitize`.

		fhdl.writelines(contents)


		def calculate_grayscale(low=168, neighbourhood=32, in_data=None):

		return tuple(map(lambda c: sum(c) / len(c), zip(*point_pairs)))


		def to_center_coords(elem, namespace, vertical=False):

		@@ -5,6 +5,8 @@

		ROOT_DIRECTORY = Path(__file__).parent.resolve()

		installation_requirements = open('requirements.txt', encoding='utf-8').read().split('\n')

Feat/generate trainingsets #205

Are you sure you want to change the base?

Feat/generate trainingsets #205

Conversation

M3ssman commented Nov 18, 2020

kba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stweil Dec 4, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kba commented Nov 18, 2020

M3ssman commented Nov 19, 2020

Shreeshrii commented Dec 7, 2020

kba commented Dec 7, 2020

Shreeshrii commented Dec 8, 2020 • edited

Shreeshrii commented Dec 12, 2020 • edited

M3ssman commented Dec 13, 2020

Shreeshrii commented Dec 14, 2020 • edited

Shreeshrii commented Dec 14, 2020 • edited

M3ssman commented Dec 14, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Shreeshrii commented Dec 15, 2020

M3ssman commented Dec 15, 2020

M3ssman commented Dec 15, 2020

Shreeshrii commented Dec 15, 2020 • edited

M3ssman commented Dec 15, 2020

stale bot commented Jan 15, 2021

Shreeshrii commented Jan 18, 2021

lgtm-com bot commented Jan 18, 2021

M3ssman commented Jan 19, 2021

zdenop commented Jan 9, 2023

M3ssman commented Jan 26, 2023

stefan6419846 commented Jan 26, 2023

M3ssman commented Jan 26, 2023

stefan6419846 commented Jan 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bertsky commented Jan 26, 2023

stweil Dec 4, 2020 •

edited

Shreeshrii commented Dec 8, 2020 •

edited

Shreeshrii commented Dec 12, 2020 •

edited

Shreeshrii commented Dec 14, 2020 •

edited

Shreeshrii commented Dec 14, 2020 •

edited

Shreeshrii commented Dec 15, 2020 •

edited