Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression against v1.0.2: scientific notation and text element positioning #526

Open
AdamLLogan opened this issue Jun 5, 2023 · 7 comments

Comments

@AdamLLogan
Copy link

The GUI/Webapp version of tabula works (almost) perfectly to grab the tables I need (from a scientific article). However, I am trying to create an automated system and the command line version cannot read certain characters correctly and merges columns. These errors occur with the same table that the GUI version is handling perfectly. I am feeding the command line version the same area that the GUI version is analyzing.
This is the GUI output.
gui_version
This is the command line version output.
command_line_version

@jeremybmerrill
Copy link
Member

The character encoding are an Excel issue, not a Tabula issue. For the merged columns, you may need to explicitly specify an extraction method and/or make sure your extraction regions are identical to in the GUI version, for instance with the bash file output.

@AdamLLogan
Copy link
Author

I should have clarified more. I am using the stream method in both cases. Using the lattice method only outputs no data. Also, they are indeed both using the same exact areas. For the character stuff, I am not worried about how excel is parsing the data, sorry for the miscommunication. I am more concerned about how tabula-java is not picking up the scientific notation fully and the merging columns, as I mentioned.

@jeremybmerrill
Copy link
Member

At the end of the day, the GUI is a front-end for the command-line version. (And the CLI version exists for automated pipelines, which I've implemented many of. So, this oughta work.)

I can't really offer much of a theory on the combined columns without seeing the PDF (at least a screenshot of the table), but in general with the stream method, columns get combined when there's some text that spans the two columns. Often headers are the culprit (or footnotes). You might try fuzzing the coordinates a little bit to see if something's being erroneously included.

Scientific notation, I don't have any theories, again without seeing more. Have you verified by opening the CSV in a text editor (or Google Sheets, which copes with Unicode in CSVs better than Excel (or, to be precise, Mac Excel)) that the characters are really absent?

@AdamLLogan
Copy link
Author

AdamLLogan commented Jun 6, 2023

Here is a screenshot of the table I am trying to analyze:
table_example
And this is the selection I am using for Tabula GUI and CLI:
table_selection
I rechecked both of the outputs using Google Sheets instead and can confirm that part of the scientific notation is missing from the CLI output.
GUI Output:
google_sheets_gui
CLI Output:
google_sheets_cml

Before I opened this issue, I tinkered with the coordinates in the hopes of fixing the output but to no avail. Thanks a lot for assisting with this.

@jeremybmerrill
Copy link
Member

I'm pretty puzzled. Maybe try the previous tabula-java version, 1.0.4? https://github.com/tabulapdf/tabula-java/releases/tag/v1.0.4 or even v1.0.2, which appears to be the version used in the GUI. Possible there was a regression.

@AdamLLogan
Copy link
Author

v1.0.2 worked perfectly (just like the GUI version output). v1.0.3 and v1.0.4 have the same undesirable behavior and output as v1.0.5. No idea what's causing this, but thanks for helping me find a workaround. I cannot share the PDF of the article directly, but if you want to look into this further the article is titled "Corrosion behavior of CoxCrCuFeMnNi high-entropy alloys prepared by hot pressing sintered in 3.5% NaCl solution" accessed via ScienceDirect. The table is on page 2.

@jeremybmerrill
Copy link
Member

Glad v1.0.2 worked. I'm surprised to see this regression.

I'm going to retitle this ticket to, at least, eventually theoretically hopefully maybe make this a test case in the test suite.
1-s2.0-S221137971932090X-main.pdf

@jazzido curious if you have a sense if this is due to upgrading the PDFBox version?

@jeremybmerrill jeremybmerrill reopened this Jun 8, 2023
@jeremybmerrill jeremybmerrill changed the title GUI version providing desired output while command-line version does not Regression against v1.0.2: scientific notation and text element positioning Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants