Skip to content

Latest commit

 

History

History

pdf-embedded-text

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

OCR with PDF Embedded Text

Document AI - Document OCR

From Release Notes

The Document AI OCR Processor has the following new features:

  • The OCR Processor now supports extracting embedded text from digital PDFs in public preview. A fallback to the optical OCR model is automatically triggered to extract text in the regions when the PDF being processed contains non-digital text. To opt into this feature, set process_options.ocr_config.enable_native_pdf_parsing=true in your API request to the OCR Processor.

Known issues with the digital PDF feature of the Document AI OCR Processor:

  • On a small number of documents, the word ordering within lines of text as reported by native text extraction might be wrong.
  • On certain documents, invisible text embedded in a native PDF may be reported.
  • On certain Japanese documents, currency symbols such as Yen might be incorrectly extracted as /.
  • On certain documents, apostrophe symbols may be missing in word/line results.
  • On certain documents, native text extraction might report different word/line results than those obtained by image-based OCR on an identical document.

Sample Document

  • A sample document has been provided that demonstrates how the results can vary by using embedded text instead of OCR detected text.
  • Declaration of Independence (Cursive)
    • This document is the text of The Declaration of Independence in a cursive script created in Google Docs.
    • Try this document with the sample code in main.py with enable_native_pdf_parsing set to True or False and compare the results.
    • Example Diff (enable_native_pdf_parsing set to True and False respectively