Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gibberish in output #489

Open
kevinburke opened this issue Aug 15, 2022 · 2 comments
Open

Gibberish in output #489

kevinburke opened this issue Aug 15, 2022 · 2 comments

Comments

@kevinburke
Copy link

I'm using Tabula for Mac. We are trying to export the tables in the attached PDF.
concord_housing_table.pdf

The initial upload generated a lot of overlapping selections. We removed all of them except for the selections that covered the entire table row.

When we go to export, the output looks like complete gibberish:

Export Data | Tabula 2022-08-15 11-07-17

We're confused about this, because clearly it's meaningful gibberish - the number of gibberish characters corresponds to the text in the original file. Maybe we missed an encoding setting? We tried using the tools in the app but didn't see anything meaningful.

@jeremybmerrill
Copy link
Member

Hi @kevinburke nice to see you here :)

This is almost certainly an issue in how pdfbox, the library Tabula uses to interact at a low-level with the PDF, handles PDFs generated in weird ways. The best fix is to re-encode the PDF with pdftk or Acrobat or a tool of your choice. That generally fixes things.

@jazzido
Copy link
Contributor

jazzido commented Aug 15, 2022

It could also be a subsetted-font, which is essentially a non-standard encoding. See this StackOverflow answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants