Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF Document does not display greek characters in text field correctly when rendered in pdf.js #17958

Open
gwtdevlpr opened this issue Apr 17, 2024 · 6 comments

Comments

@gwtdevlpr
Copy link

Attached PDF file:

Configuration:

  • Web browser and its version: Mozilla any version
  • Operating system and its version: Any Os, Tested on macOS
  • PDF.js version: v4.1.392 Latest
  • Is a browser extension:

Steps to reproduce the problem:

  1. Open the Attached PDF in https://mozilla.github.io/pdf.js/web/viewer.html
  2. We don't see the greek text in textbox fields but some junk characters

What is the expected behavior? (add screenshot)
Display proper greek characters (Screenshot of same file opened in chrome)

What went wrong? (add screenshot)
Junk characters shown instead of proper greek characters in text fields

Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension):

@choradt
Copy link

choradt commented Apr 17, 2024

Attaching a sample document.
Copy.pdf

Working and non-working screenshot of the rendered document (Chrome and Firefox)
scrshot

@calixteman
Copy link
Contributor

When clicking on one of the fields, the value should be correct.
That said when rendering a fleld, we extract the text from its appearance to display it in the input but when it's focused then we use its value (from the V entry). Here the appearance is using a font and an identity-H encoding so it's why we extract a wrong string.

@gwtdevlpr
Copy link
Author

On click of the field it shows correct value and on moving away, it goes back to junk characters again. Could it be fixed to show right value all the time?

@choradt
Copy link

choradt commented Apr 17, 2024

When pdf.js is used to render pdf inside a viewer, fields are not clickable.

But even if they were, no-one could not expect a user, to click on 40 or 50 fields in a form just to be able to read them.

This must be solved in a way that characters display ok, without clicking inside the field.

@calixteman
Copy link
Contributor

@Snuffleupagus, do you know what would be the exact criteria to guess that we won't able to guess the unicode string from the string in Tj and the font properties ?
I guess a missing ToUnicode doesn't help but there are likely some cases (for example with basic ascii strings) where it's possible to guess even without a ToUnicode.
My idea is to use this info when we're getting the text from the appearance:
https://github.com/mozilla/pdf.js/blob/master/src/core/annotation.js#L1234-L1243
to not use it.
In the pdf here, the font has a missing file, no ToUnicode and a cidEncoding set to Identity-H.

@gwtdevlpr

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants