[improvement] `.render()` isn't that robust - wrong ordered results #1586

kripper · 2024-05-06T15:11:35Z

Bug description

The default OCR model works very well, but the render() algorithm which converts coordinates to text positions is very buggy.
This causes lines originally placed at the top to be positioned between other lines at the bottom, making the overall result unusable for LLM inference.

I wonder if you have considered reusing the algorithm implemented in Tesseract. They probably solved the same problem many years ago.
And I also wonder why the Tesseract team is not integrating the doctr engine into Tesseract :-)

Good job! You are leading the OCR leaderboard.

I attached a sample .PDF file and a snippet to reproduce the problem.
I checked other similar inactive issues, so I'm afraid rendering to text is currently not a hot topic :-(
...but how are we suposed to feed our hungry LLMs?

Code snippet to reproduce the bug

import argparse
import os
import json

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

def convert_pdf_to_txt(input_pdf, output_txt):
  """
  Converts a PDF file to a text file using DocTR OCR.

  Args:
      input_pdf (str): Path to the input PDF file.
      output_txt (str): Path to the output text file.
  """

  print("Load pre-trained OCR model")
  model = ocr_predictor(pretrained=True)

  # Ensure input PDF exists
  if not os.path.exists(input_pdf):
    raise ValueError(f"Input PDF file '{input_pdf}' does not exist.")

  # Load the PDF document
  try:
    doc = DocumentFile.from_pdf(input_pdf)
  except Exception as e:
    raise ValueError(f"Error loading PDF '{input_pdf}': {e}")

  # Perform OCR and extract text
  try:
    result = model(doc)
    #exp = result.export()
    #text = json.dumps(exp)
    text = result.render()
  except Exception as e:
    raise ValueError(f"Error performing OCR on '{input_pdf}': {e}")

  # Write extracted text to output file
  with open(output_txt, 'w', encoding='utf-8') as f:
    f.write(text)

  print(f"PDF '{input_pdf}' converted to text file '{output_txt}'.")

if __name__ == "__main__":
  parser = argparse.ArgumentParser(description="Convert PDF to text using DocTR OCR")
  parser.add_argument("input_pdf", help="Path to the input PDF file")
  parser.add_argument("output_txt", help="Path to the output text file")
  args = parser.parse_args()

  convert_pdf_to_txt(args.input_pdf, args.output_txt)

Error traceback

No error

Environment

Linux, conda, python 3.9

Deep Learning backend

Default model.
test-ocr.pdf

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2024-05-08T06:31:34Z

Hi @kripper 👋,

Thanks for reporting :)

The issue here is that page 2 & 3 contains small rotations could you give it a try with passing assume_straight_pages=False to the ocr_predictor instance ? :)

kripper · 2024-05-08T08:29:00Z

Predictor initiated with:

model = ocr_predictor(pretrained=True, assume_straight_pages=False)

But the probelm persists on page 1:

Notario y Conservador de Bienes Raices Licanten Vilma Beatriz Navarro
<--- "Reyes" SHOULD GO HERE
Certifico que el presente documento electronico es copia fiel e integra de
CERTIFICADO otorgado el 26 de Abril de 2024 reproducido en las siguientes

Reyes <-------- BUT WAS PLACED HERE

paginas.

Also note that the OCR'ed page (page 1) is a clean PDF page.
The second page is an image and assume_straight_pages could help here.

Cata400 · 2024-05-08T14:26:48Z

From what I have also seen, sometimes the models predict lines in the wrong block, even though their coordinates are correct. This is why the render() method returns the text mixed up, as it is only a bunch of nested for loops going over all the pages, blocks, lines and words. To get over it I did this, although it kind of messes up the line breaks, it preserves the order:

def sort_by_coordinates(element):
    return (element.geometry[0][1], element.geometry[0][0]) 

result = model(doc)
text = ""
 
for page in result.pages:
    line_list = []
    
    for block in page.blocks:
        line_list.extend(block.lines)
        
    sorted_lines = sorted(line_list, key=sort_by_coordinates)
    
    for line in sorted_lines:
        for word in line.words:
            text += word.text + " "
        text += "\n"
        
    text += "\n"

felixdittrich92 · 2024-05-08T14:45:45Z

@kripper Have you already tried to disable block and/or line resolving ?
https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False
resolve_lines=False

kripper · 2024-05-08T15:47:07Z

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches

resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

felixdittrich92 · 2024-05-08T15:55:20Z

@kripper Have you already tried to disable block and/or line resolving ? https://mindee.github.io/doctr/using_doctr/using_models.html#two-stage-approaches
resolve_blocks=False resolve_lines=False

It's now mixing blocks multiple times per line.

What about taking a look at Tesseract's implementation?

Sure :)
Do you have a direct reference to the code or algorithm ?

kripper · 2024-05-08T15:57:38Z

Do you have a direct reference to the code or algorithm ?

No, but I will research tomorrow.

kripper · 2024-05-08T16:10:37Z

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

felixdittrich92 · 2024-05-08T16:18:25Z

Have you tried existing tools to convert doctr's HOCR output to text? There are many. Tesseract probably is also using some of them.

Yeah you can use doctr's XML/hocr output to create PDF/A files for example with OCRmyPDF

kripper · 2024-05-08T17:17:14Z

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

felixdittrich92 · 2024-05-08T18:05:18Z

sometimes the models predict lines in the wrong block

The synthesized page looks fine. Identifying lines shouldn't be that difficult IMO.

Depends on the documents layout ^^ And there is a lot of difference (rotated, block text, etc.)

felixdittrich92 · 2024-05-22T07:43:25Z

Especially with rotations all other open source tools (paddleOCR / tesseract / easyOCR) fail also:

Possible way to go: https://arxiv.org/abs/2305.02577 --> investigate

kripper added the type: bug Something isn't working label May 6, 2024

felixdittrich92 changed the title ~~Wrong layout generated by render() to text~~ [improvement] .render() isn't that robust - wrong ordered results May 22, 2024

felixdittrich92 self-assigned this May 22, 2024

felixdittrich92 added type: enhancement Improvement help wanted Extra attention is needed module: models Related to doctr.models labels May 22, 2024

felixdittrich92 added this to the 2.0.0 milestone May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvement] `.render()` isn't that robust - wrong ordered results #1586

[improvement] `.render()` isn't that robust - wrong ordered results #1586

kripper commented May 6, 2024 •

edited

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

Cata400 commented May 8, 2024 •

edited

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

felixdittrich92 commented May 22, 2024

[improvement] .render() isn't that robust - wrong ordered results #1586

[improvement] .render() isn't that robust - wrong ordered results #1586

Comments

kripper commented May 6, 2024 • edited

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

Cata400 commented May 8, 2024 • edited

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

kripper commented May 8, 2024

felixdittrich92 commented May 8, 2024

felixdittrich92 commented May 22, 2024

[improvement] `.render()` isn't that robust - wrong ordered results #1586

[improvement] `.render()` isn't that robust - wrong ordered results #1586

kripper commented May 6, 2024 •

edited

Cata400 commented May 8, 2024 •

edited