Add test cases for AGL. #101

dannywinrow · 2022-01-15T20:00:45Z

I tried to run pdPageExtractText on the pdf located:
https://www.gov.im/media/1360682/isle-of-man-inflation-report-november-2021.pdf

However, every character of the text was being interpreted as "\0"

After much pain and effort trailing through the PDFIO code, I have identified the problem as being what is returned by the fum function in PDFont. In particular when the cn"Encoding" object contains a /Differences object with values such as /uni0047 which just represent the unicode character U+47 ('p'). Since the AGL_Glyph_To_Unicode dictionary (not sure where this comes from) doesn't contain the simple unicode mappings then the zero(Char) is returned instead.

One solution might be to just compare the /uni0047 to the base encoding dictionary and if the 0x0047 part exists then add a dictionary entry. Another solution would be to add all of the standard unicode characters that already exist in your base encoding such as /uni0047 to the AGL_Glyph_To_Unicode dictionary.

I have made the assumption, when suggesting this solution, that the cn"Encoding" object is taken directly from the pdf file and not further processed.

If you'd like me to try to create a pull request, I'd be happy to, but I thought I'd ask first in case your more holistic view of the project leads to a more effective solution.

dannywinrow · 2022-01-16T02:52:36Z

I have updated this issue, since I think I have found the crux of the problem which is that PDFIO is missing part of the AGL specification which states that you first match to AGL, and if no match then you test whether it is a unicode character of the form uniXXXX or uXXXX (see specification for general case and restrictions)

sambitdash · 2022-11-20T16:32:09Z

6367aa6 Fixes it but no test cases are added as the file is no longer accessible.

sambitdash · 2022-11-20T16:32:30Z

Add test cases for AGL.

sambitdash · 2022-11-22T19:29:29Z

isle-of-man-inflation-report-november-2021.pdf
Adding a copy of the file which I got by Googling. But, this version does not have an AGL code. The suggested file is no longer on the site.We need to look for a better test file.

dannywinrow mentioned this issue Jan 15, 2022

Fix for font unicode character map where glyph is of the format /uniXXXX #102

Closed

dannywinrow changed the title ~~TrueType font not working when differences contains unicode glyphs e.g. /uni0047~~ Part of AGL specification not implemented Jan 16, 2022

sambitdash closed this as completed Nov 20, 2022

sambitdash reopened this Nov 20, 2022

sambitdash changed the title ~~Part of AGL specification not implemented~~ Add test cases for AGL. Nov 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test cases for AGL. #101

Add test cases for AGL. #101

dannywinrow commented Jan 15, 2022 •

edited

dannywinrow commented Jan 16, 2022 •

edited

sambitdash commented Nov 20, 2022

sambitdash commented Nov 20, 2022

sambitdash commented Nov 22, 2022

Add test cases for AGL. #101

Add test cases for AGL. #101

Comments

dannywinrow commented Jan 15, 2022 • edited

dannywinrow commented Jan 16, 2022 • edited

sambitdash commented Nov 20, 2022

sambitdash commented Nov 20, 2022

sambitdash commented Nov 22, 2022

dannywinrow commented Jan 15, 2022 •

edited

dannywinrow commented Jan 16, 2022 •

edited