Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the OCRed PDF #1595

Closed
micos7 opened this issue May 13, 2024 · 2 comments
Closed

Save the OCRed PDF #1595

micos7 opened this issue May 13, 2024 · 2 comments
Labels
awaiting response Waiting for feedback type: enhancement Improvement

Comments

@micos7
Copy link

micos7 commented May 13, 2024

馃殌 The feature

I`d like to save the pdf after OCR

Motivation, pitch

Alternatives

I tried something like this but exceptions all over

   `doc = DocumentFile.from_pdf(pdf_content)
    # Perform OCR using doctr
    model = ocr_predictor(pretrained=True)
    result = model(doc)



    # Extract text from the OCR result
    text = ""
    for page in result.pages:
        for block in page.blocks:
            for line in block.lines:
                for word in line.words:
                    text += word.value + " "

    # Save the OCR result back to the original file path
    with open(body.url, 'w') as pdf_file:
        pdf_file.write(text)`

Additional context

Thanks for your work.

No response

@micos7 micos7 added the type: enhancement Improvement label May 13, 2024
@felixdittrich92
Copy link
Contributor

Hi @micos7 : 馃憢
What you want to create is a PDF/A File (PDF with text layer).
Please take a look at https://mindee.com/blog/create-ocrized-pdfs-in-2-steps :)

@felixdittrich92 felixdittrich92 added the awaiting response Waiting for feedback label May 17, 2024
@felixdittrich92
Copy link
Contributor

Any updates @micos7 ? :)

@micos7 micos7 closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response Waiting for feedback type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

2 participants