Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configs for DPI, Page Segmentation Mode, and Zotero non-linked attachments #41

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

danpf
Copy link

@danpf danpf commented Jun 29, 2022

Thought I'd take a stab at this.

Added outputDPI and outputAsCopyAttachment as configuration options.

It seems to work, but I'm unable to get it to work with group libraries - do you have any idea why that might be?
briefly:

It works when I have a pdf selected on my personal 'My library' sub-collection, but when I use it on something selected in a sub-collection in my 'group library' I get errors like (below). The errors happen with the zotero-ocr plugin as well so maybe I shouldn't be basing my logic off that plugin and that's my problem.

[JavaScript Error: "Parent item 1/4Q5DY97J not found" {file: "chrome://zotero/content/xpcom/data/item.js" line: 1537}]

My guess is that for some reason in group libraries parents are mangled in the database, but I'm not sure how to check or confirm.
because the code to me appears correct and this line
https://github.com/danpf/zotero-ocr/blob/9eb9a8ec9a5ada40be27d07ca6de847637c14d2b/chrome/content/zoteroocr.js#L105 seems to be returning the right stuff.

I made a post in zotero dev about the issue but didn't get a response:
https://groups.google.com/g/zotero-dev/c/LVmcjIMqYvA

@stweil stweil changed the title [WIP] Add configs for DPI + copyattacments [WIP] Add configs for DPI + copyattachments Mar 6, 2023
@danpf
Copy link
Author

danpf commented Mar 11, 2023

Not sure if you are interested in this @stweil

but I got a response from the Zotero devs, and was able to get this PR fixed for Group Library + 'hard' attachments. Their API is currently incompatible with linked attachments in the Group Libraries section. I think it only would make sense for them to implement that in the context of network drives, so they probably won't address that.

Docs:
This PR adds 3 new options to ZoteroOCR

  • The ability to modify the output DPI
    • The default is set to 300
  • The ability to modify the Tesseract Page Segmentation Mode (PSM)
  • The ability to add the new PDFs as attachments rather than 'linked files'

I have confirmed that this PR works on an M1 macbook, and here is a new screenshot of the settings panel
image

If you would be interested in merging, please confirm that it works on your device as well. I don't normally touch JS.

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

@danpf danpf changed the title [WIP] Add configs for DPI + copyattachments Add configs for DPI, Page Segmentation Mode, and Zotero non-linked attachments Mar 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant